git.ipfire.org Git - thirdparty/haproxy.git/log

MINOR: proto_tcp: tcp_bind_listener: copy errno in errmsg

Let's copy errno in errmsg produced by tcp_bind_listener if it fails in
a syscall(). This is helpful to debug issues, while binding listeners.

BUG/MINOR: proto_tcp: keep error msg if listen() fails

If listen() fails, we need to keep the message about it, which is copied then
in errmsg buffer on the error path. This buffer is properly provided by the
caller (protocol_bind_all()) and reallocated if needed in memprintf(), but
it was deleted without being returned.

This can be backported to all stable versions.

BUG/MINOR: proto_tcp: delete fd from fdtab if listen() fails

If listen() fails, fd should be deleted from fdtab, not just closed. Otherwise,
sock_inet_bind_receiver(), which is called in loop for each receiver, will
obtain the same fd via socket() for the next receiver, registered in the
receivers list. Then, it will bind it again and it will try to re-insert it in
fdtab, and fd_insert() will trigger the BUG_ON(fdtab[fd].owner != NULL) check.

When tcp_bind_listener() code was implemented, the use of fd_delete() was
not generalized and this one remained overlooked.

This can be backported to all stable versions.

[RELEASE] Released version 3.1-dev5

Released version 3.1-dev5 with the following main changes :
    - BUG/MINOR: quic: Lack of precision when computing K (cubic only cc)
    - MEDIUM: ssl/quic: implement quic crypto with EVP_AEAD
    - MINOR: quic: rename confusing wording aes to hp
    - MEDIUM: quic: add key argument to header protection crypto functions
    - MEDIUM: quic: implement CHACHA20_POLY1305 for AWS-LC
    - MEDIUM: sink: assume sft appctx stickiness
    - MINOR: quic: delay Retry emission on quic-force-retry
    - MEDIUM: quic: implement quic-initial rules
    - MINOR: quic: support ACL for quic-initial rules
    - MINOR: quic: pass quic_dgram as obj_type for quic-initial rules
    - MINOR: quic: implement reject quic-initial action
    - MINOR: quic: implement send-retry quic-initial rules
    - BUG/MEDIUM: quic: fix invalid conn reject with CONNECTION_REFUSED
    - MEDIUM: h1: allow to preserve keep-alive on T-E + C-L
    - MINOR: quic: Add information to "show quic" for CUBIC cc.
    - MINOR: quic: Dump TX in flight bytes vs window values ratio.
    - BUG/MEDIUM: jwt: Clear SSL error queue on error when checking the signature
    - BUILD: cfgparse-quic: fix build error on Solaris due to missing netinet/in.h
    - MINOR: queue: add a function to check for TOCTOU after queueing
    - BUG/MEDIUM: queue: deal with a rare TOCTOU in assign_server_and_queue()
    - DOC: config: Add documentation about spop mode for backends
    - BUG/MEDIUM: stconn: Report error on SC on send if a previous SE error was set
    - BUG/MEDIUM: mux-pt/mux-h1: Release the pipe on connection error on sending path
    - BUILD: mux-pt: Use the right name for the sedesc variable
    - BUG/MINOR: stconn: bs.id and fs.id had their dependencies incorrect
    - BUG/MEDIUM: ssl: reactivate 0-RTT for AWS-LC
    - BUG/MEDIUM: ssl: 0-RTT initialized at the wrong place for AWS-LC
    - BUILD: ssl: replace USE_OPENSSL_AWSLC by OPENSSL_IS_AWSLC
    - BUG/MEDIUM: quic: prevent conn freeze on 0RTT undeciphered content
    - MINOR: tcp_sample: Move TCP low level sample fetch function to control layer
    - MINOR: quic: Define ->get_info() control layer callback for QUIC
    - MINOR: flags/mux-quic: decode qcc and qcs flags
    - BUG/MINOR: quic: fix fc_rtt/srtt values
    - BUG/MIONR: quic: fix fc_lost
    - BUG/MINOR: h1: do not forward h2c upgrade header token
    - BUG/MINOR: h2: reject extended connect for h2c protocol
    - BUG/MEDIUM: http-ana: Report error on write error waiting for the response
    - BUG/MEDIUM: h2: Only report early HTX EOM for tunneled streams
    - BUG/MEDIUM: mux-h2: Propagate term flags to SE on error in h2s_wake_one_stream
    - BUG/MEDIUM: peer: Notify the applet won't consume data when it waits for sync
    - BUG/MINOR: quic: Too shord datagram during O-RTT handshakes (aws-lc only)
    - CI: add weekly QUIC Interop regression against AWS-LC
    - CI: harden NetBSD builds by ERR=1
    - BUG/MINOR: quic: Too short datagram during packet building failures (aws-lc only)
    - DEV: coccinelle: add a test to detect unchecked strdup()
    - BUG/MINOR: fcgi-app: handle a possible strdup() failure
    - BUG/MEDIUM: server/addr: fix tune.events.max-events-at-once event miss and leak
    - MINOR: quic: convert qc_stream_desc release field to flags
    - MINOR: quic: implement function to check if STREAM is fully acked
    - BUG/MEDIUM: quic: handle retransmit for standalone FIN STREAM
    - MINOR: quic: enforce ACK reception is handled in order
    - DOC: configuration: fix alphabetical ordering of {bs,fs}.aborted
    - MINOR: stconn: add a new pair of sf functions {bs,fs}.debug_str
    - MINOR: mux-h2: implement the debug string for logs
    - MINOR: mux-quic: define dump functions for QCC and QCS
    - MINOR: mux-quic: implement debug string for logs
    - MINOR: quic: dump quic_conn debug string for logs
    - MINOR: time: define tot_time structure
    - MINOR: mux-quic: measure QCS lifetime and its blocking state
    - BUG/MINOR: trace/quic: enable conn/session pointer recovery from quic_conn
    - BUG/MINOR: trace/quic: permit to lock on frontend/connect/session etc
    - BUG/MEDIUM: trace: fix null deref in lockon mechanism since TRACE_ENABLED()
    - BUG/MINOR: trace: automatically start in waiting mode with "start <evt>"
    - BUG/MINOR: trace/quic: make "qconn" selectable as a lockon criterion
    - BUG/MINOR: quic/trace: make quic_conn_enc_level_init() emit NEW not CLOSE
    - MINOR: trace: support setting the sink and level for all sources at once
    - MINOR: session/trace: enable very minimal session tracing
    - MEDIUM: trace: implement a "follow" mechanism
    - MINOR: trace: move the known trace context into a dedicated struct
    - MINOR: trace: add a per-source helper to pre-fill the context
    - MINOR: mux-h2: add a trace context filling helper
    - MINOR: mux-h1: add a trace context filling helper
    - MINOR: mux-quic: don't leave dangling pointer after freeing qcs->sd
    - MINOR: mux-quic: add a trace context filling helper
    - MINOR: mux-h1/trace: add a state trace on stream creation/upgrade
    - MINOR: mux-h2/trace: add a state trace on stream creation/destruction
    - MINOR: mux-h3/trace: add a state trace on stream creation/destruction
    - BUG/MINOR: quic: prevent freeze after early QCS closure
    - MINOR: server: ensure max_events_at_once > 0 in server_atomic_sync()
    - MINOR: cfgparse: add struct cfgfile to represent config in memory
    - REORG: tools: move list_append_word to cfgparse
    - MINOR: startup: adapt list_append_word to use cfgfile
    - MINOR: cfgparse: add load_cfg_in_mem
    - MINOR: cfgparse: load_cfg_in_mem: take in account file size
    - MINOR: tools: add fgets_from_mem
    - MEDIUM: startup: make read_cfg() return immediately on ENOMEM
    - MEDIUM: startup: load and parse configs from memory
    - MINOR: startup: rename readcfgfile in parse_cfg

MINOR: startup: rename readcfgfile in parse_cfg

As readcfgfile no longer opens configuration files and reads them with fgets,
but performs only the parsing of provided data, let's rename it to parse_cfg by
analogy with read_cfg in haproxy.c.

MEDIUM: startup: load and parse configs from memory

Let's call load_cfg_in_ram() helper for each configuration file to load it's
content in some area in memory. Adapt readcfgfile() parser function
respectively. In order to limit changes in its scope we give as an argument a
cfgfile structure, already filled in init_args() and in load_cfg_in_ram() with
file metadata and content.

Parser function (readcfgfile()) uses now fgets_from_mem() instead of standard
fgets from libc implementations.

SPOE filter parses its own configuration file, pointed by 'config' keyword in
the configuration already loaded in memory. So, let's allocate and fill for
this a supplementary cfgfile structure, which is not referenced in cfg_cfgfiles
list. This structure and the memory with content of SPOE filter configuration
are freed immediately in parse_spoe_flt(), when readcfgfile() returns.

HAProxy OpenTracing filter also uses its own configuration file. So, let's
follow the same logic as we do for SPOE filter.

MEDIUM: startup: make read_cfg() return immediately on ENOMEM

This commit prepares read_cfg() to call load_cfg_in_mem() helper in order to
load configuration files in memory. Before, read_cfg() calls the parser for all
files from cfg_cfgfiles list and cumulates parser's errors and memprintf's
errors in for_each loop. memprintf's errors did not stop this loop and were
accounted just after.

Now, as we plan to load configuration files in memory, we stop the loop, if
memprintf() fails, and we show appropraite error message with ha_alert. Then
process terminates. So not all cumulated syntax-related errors will be shown
before exit in this case and we has to stop, because we run out of memory.

If we can't open the current file or we fail to allocate a memory to store
some configuration line, the previous behaviour is kept, process emits
appropriate alert message and exits.

If parser returns some syntax-related error on the current file, the previous
behaviour is kept as well. We cumulate such errors for all parsed files and we
check them just after the loop. All syntax-related errors for all files is
shown then as before in ha_alert messages line by line during the startup.
Then process will exit with 1.

As now cfg_cfgfiles list contains many pointers to some memory areas with
configuration files content and this content could be big, it's better to
free the list explicitly, when parsing was finished. So, let's change
read_cfg() to return some integer value to its caller init(), and let's perform
the free routine at a caller level, as cfg_cfgfiles list was initialized and
initially filled at this level.

MINOR: tools: add fgets_from_mem

Add fgets_from_mem() helper to read lines from configuration files, stored now
as memory chunks. In order to limit changes in the first-level parser code
(readcfgfile()), it is better to reimplement the standard fgets, i.e. to
have a fgets, which can read the serialized data line by line from some memory
area, instead of file stream, and can keep the same behaviour as libc
implementations fgets.

MINOR: cfgparse: load_cfg_in_mem: take in account file size

Let's take in account the given file size, when its reported via stat.

It's very convenient for large configuration files, as this allows to
perform only the one memory allocation call for precisely needeed file size.
This also allows to perform only the one call to fread().

We need to provide to fread() file_stat.st_size + 1 to be able to grab EOF.
Like this it sets feof(f)=1 flag and this allows to exit from the loop
immediately, just after fread call.

If /dev/stdin or /dev/null is provided as a file, we continue to read the
configuration chunk by chunk, stat doesn't report the size.

MINOR: cfgparse: add load_cfg_in_mem

Add load_cfg_in_mem() helper, which allows to store the content of a given file
in memory.

MINOR: startup: adapt list_append_word to use cfgfile

list_append_word() helper was used before only to chain configuration file names
in a list. As now we start to use cfgfile structure which represents entire file
in memory and its metadata, let's adapt this helper to use this structure and
let's rename it to list_append_cfgfile().

Adapt functions, which process configuration files and directories to use
cfgfile structure and list_append_cfgfile() instead of wordlist.

REORG: tools: move list_append_word to cfgparse

Let's move list_append_word to cfgparse.c as it is used only to fill
cfg_cfgfiles list with configuration file names.

MINOR: cfgparse: add struct cfgfile to represent config in memory

This and following commits serve to prepare loading configuration files in
memory, before parsing them, as we may need to parse some parts of
configuration in different moments of the startup sequence. This is a case of
the new master-worker initialization process. Here we need to read at first
only the global and the program sections and only after some steps
(forking worker, etc) the rest of the configuration.

Add a new structure cfgfile to keep configuration files metadata and content,
loaded somewhere in a memory. Instances of filled cfgfile structures could be
chained in a list, as the order in which they were loaded is important.

MINOR: server: ensure max_events_at_once > 0 in server_atomic_sync()

In 8f1fd96 ("BUG/MEDIUM: server/addr: fix tune.events.max-events-at-once
event miss and leak"), we added a comment saying that
tune.events.max-events-at-once is assumed to be strictly positive.

It is so because the keyword parser forces values between 1 and 10000:
we don't want less than 1 because it wouldn't make any sense, and 10k
max because beyond that we could create contention in server_atomic_sync()

Now as the above commit implements a do..while it heavily relies on the
fact that the budget is at least 1. Upon soft-stop, we break away from
the loop without decrementing the budget. With all that in mind, it is
safe to assume that the 'remain' counter will only fall to 0 if the task
runs out of budget while doing work, in which case the task still exists
and must be rescheduled.

As seen in GH #2667 this assumption was ambiguous, so let's make it
official by adding a pair of BUG_ON() that make it explicit that it
works because remain 'cannot' be 0 unless the entire budget was
consumed.

No backport needed.

BUG/MINOR: quic: prevent freeze after early QCS closure

A connection freeze may occur if a QCS is released before transmitting
any data. This can happen when an error is detected early by the stream,
for example during HTTP response headers encoding, forcing the whole
connection closure.

In this case, a connection error is registered by the QUIC MUX to the
lower layer. MUX is then release and xprt layer is notified to prepare
CONNECTION_CLOSE emission. However, this is prevented because quic_conn
streams tree is not empty as it contains the qc_stream_desc previously
attached to the failed QCS instance. The connection will freeze until
QUIC idle timeout.

This situation is caused by an omission during qc_stream_desc release
operation. In the described situation, qc_stream_desc current buffer is
empty and can thus by removed, which is the purpose of this patch. This
unblocks this previously failed situation, with qc_stream_desc removal
from quic_conn tree.

This issue can be reproduced by modifying H3/QPACK code to return an
early error during HEADERS response processing.

This must be backported up to 2.6, after a period of observation.

MINOR: mux-h3/trace: add a state trace on stream creation/destruction

Logging below the developer level doesn't always yield very convenient
traces as we don't know well where streams are allocated nor released.
Let's just make that more explicit by using state-level traces for these
important steps.

MINOR: mux-h2/trace: add a state trace on stream creation/destruction

Logging below the developer level doesn't always yield very convenient
traces as we don't know well where streams are allocated nor released.
Let's just make that more explicit by using state-level traces for these
important steps.

MINOR: mux-h1/trace: add a state trace on stream creation/upgrade

Logging below the developer level doesn't always yield very convenient
traces as we don't know well where streams are allocated nor released.
Let's just make that more explicit by using state-level traces. Note that
h1s destruction was already logged as closing connection or switching
to idle mode.

MINOR: mux-quic: add a trace context filling helper

This helper is able to find a connection, a session, a stream, or a
frontend from its args.

MINOR: mux-quic: don't leave dangling pointer after freeing qcs->sd

In qcs_free() we're calling a few other functions after releasing
qcs->sd. None of them make use of it for now but with traces that
will change. Make sure to clear qcs->sd after releasing it.

MINOR: mux-h1: add a trace context filling helper

This helper is able to find a connection, a session, a stream, a
frontend or a backend from its args.

MINOR: mux-h2: add a trace context filling helper

This helper is able to find a connection, a session, a stream, a
frontend or a backend from its args.

Note that this required to always make sure that h2s->sess is reset on
allocation because it's normally initialized later for backend streams,
and producing traces between the two could pre-fill a bad pointer in
the trace_ctx.

MINOR: trace: add a per-source helper to pre-fill the context

Now sources which want to do it can provide a helper that can pre-fill
some fields in the context based on their knowledge (e.g. mux streams).

MINOR: trace: move the known trace context into a dedicated struct

We now have a trace_ctx to hold the sess, conn, qc, stream and so on.
This will allow us to pass it across layers so that other helpers can
help fill them.

Ideally it should be passed as an argument to __trace_enabled() by
__trace() so that it can be passed back to the trace callback. But
it seems that trace callbacks are smart enough to figure all their
info when they need them.

MEDIUM: trace: implement a "follow" mechanism

With "follow" from one source to another, it becomes possible for a
source to automatically follow another source's tracked pointer. The
best example is the session:
  - the "session" source is enabled and has a "lockon session"
    -> its lockon_ptr is equal to the session when valid
  - other sources (h1,h2,h3 etc) are configured for "follow session"
    and will then automatically check if session's lockon_ptr matches
    its own session, in which case tracing will be enabled for that
    trace (no state change).

It's not necessary to start/pause/stop traces when using this, only
"follow" followed by a source with lockon enabled is needed. Some
combinations might work better than others. At the moment the session
is almost never known from the backend, but this may improve.

The meta-source "all" is supported for the follower so that all sources
will follow the tracked one.

MINOR: session/trace: enable very minimal session tracing

By having traces at the session level, it becomes possible to start
traces on session creation and pause them on session end. Doing so
will soon open new possibilties to synchronize multiple traces.

MINOR: trace: support setting the sink and level for all sources at once

It's extremely painful to have to set "trace <src> sink buf1" for all
sources, then to do the same for "level developer" (for example). Let's
have a possibility via a meta-source "all" to apply the change to all
sources at once. This currently supports level and sink, which are not
dependent on the source, this is a good start.

BUG/MINOR: quic/trace: make quic_conn_enc_level_init() emit NEW not CLOSE

The event emitted by this trace was of type CLOSE instead of NEW, which
would somtimes temporarily pause a started trace.

This can be backported to 3.0, probably 2.6.

BUG/MINOR: trace/quic: make "qconn" selectable as a lockon criterion

The test was was performed but there's no way to set the option! Let's
just add "qconn" to select the quic conn when the source supports it.

This can be backported at least to 3.0, probably 2.6.

BUG/MINOR: trace: automatically start in waiting mode with "start <evt>"

The doc clearly says that "start <evt>" should leave the trace in pause
mode until the indicated event appears. However it's not what's happening,
the state is not changed until one command uses "now", so it's typically
needed to configure the events with "start <evt>" then enable the waiting
mode using "pause now". This is counter-intuitive and does not match the
doc, so let's fix it so that "start <evt>" switches from stopped to waiting
as long as at least one event is enabled.

This can be backported to all versions.

BUG/MEDIUM: trace: fix null deref in lockon mechanism since TRACE_ENABLED()

When calling TRACE_ENABLED(), which is called by TRACE_PRINTF(), we pass
a NULL plockptr to __trace_enabled(). This argument is used when lockon
is active, and may update the pointer. This is an overlook which also
broke the lockon mechanism because now for calls from __trace(), it
dereferences a pointer pointing to NULL, and never updates it due to the
broken condition, so that trace() never sets up src->lockon_ptr.

The bug was introduced in 2.8 by commit 8f9a9704bb ("MINOR: trace: add a
TRACE_ENABLED() macro to determine if a trace is active"), so the fix must
be backported there.

BUG/MINOR: trace/quic: permit to lock on frontend/connect/session etc

These ones were not proposed in the list of trackable elements. Note
that this depends on previous commit:

BUG/MINOR: trace/quic: enable conn/session pointer recovery from quic_conn

This should be backported to at least 3.0, maybe even 2.6.

BUG/MINOR: trace/quic: enable conn/session pointer recovery from quic_conn

In __trace_enabled(), a quic_conn was detected, but it was not possible
to derive the connection nor the session from it, which was quite limiting
in terms of ability to track a same instance.

This should be backported to at least 3.0, maybe even 2.6.

MINOR: mux-quic: measure QCS lifetime and its blocking state

Reuse newly defined tot_time structure to measure various values related
to a QCS lifetime.

First, a timer is used to comptabilize the total QCS lifetime. Then, two
other timers are used to account the total time during which Tx from
stream layer to MUX is blocked, either on lack of buffer or due to
flow-control.

These three timers are reported in qmux_dump_qcs_info(). Thus, they are
available in traces and for QUIC MUX debug string sample.

MINOR: time: define tot_time structure

Define a new utility type tot_time. Its purpose is to be able to account
elapsed time accross multiple periods. Functions are defined to easily
start and stop measures, and return the current value.

MINOR: quic: dump quic_conn debug string for logs

Define a new xprt_ops callback named dump_info. This can be used to
extend MUX debug string with infos from the lower layer.

Implement dump_info for QUIC stack. For now, only minimal info are
reported : bytes in flight and size of the sending window. This should
allow to detect if the congestion controller is fine. These info are
reported via QUIC MUX debug string sample.

MINOR: mux-quic: implement debug string for logs

Implement MUX_SCTL_DBG_STR for QUIC MUX. This returns info for the
current QCS and QCC instances, reusing qmux_dump_qc{c,s}_info functions
already used for traces, and the connection flags.

This stream operation is useful for debug string sample support.

MINOR: mux-quic: define dump functions for QCC and QCS

Extract trace code to dump QCC and QCS instances into dedicated
functions named qmux_dump_qc{c,s}_info(). This will allow to easily
print QCC/QCS infos outside of traces.

MINOR: mux-h2: implement the debug string for logs

Now it permits to have this for a front and a back:

<134>Jul 30 19:32:53 haproxy[24405]: 127.0.0.1:64860 [30/Jul/2024:19:32:53.732] test2 test2/s1 0/0/0/0/0 200 130 - - ---- 2/1/0/0/0 0/0 "GET /blah HTTP/2.0"  h2s.id=1 .st=CLO .flg=0x7003 .rxbuf=0@(nil)+0/0 .sc=0x1e03fb0(.flg=0x00034482 .app=0x1e04020) .sd=0x1e03f30(.flg=0x50405601) .subs=(nil) h2c.st0=FRH .err=0 .maxid=1 .lastid=-1 .flg=0x100e00 .nbst=0 .nbsc=1, .glitches=0 .fctl_cnt=0 .send_cnt=0 .tree_cnt=1 .orph_cnt=0 .sub=1 .dsi=1 .dbuf=0@(nil)+0/0 .mbuf=[1..1|32],h=[0@(nil)+0/0],t=[0@(nil)+0/0] .task=(nil) conn.flg=0x80000300
<134>Jul 30 19:32:53 haproxy[24405]: 127.0.0.1:65246 [30/Jul/2024:19:32:53.732] test1 test1/s1 0/0/0/0/0 200 130 - - ---- 2/1/0/0/0 0/0 "GET /blah HTTP/1.1"  h2s.id=1 .st=CLO .flg=0x7003 .rxbuf=0@(nil)+0/0 .sc=0x1dfc7b0(.flg=0x0006d01b .app=0x1c65fe0) .sd=0x1dfc820(.flg=0x1040ca01) .subs=(nil) h2c.st0=FRH .err=0 .maxid=1 .lastid=-1 .flg=0x108e00 .nbst=0 .nbsc=1, .glitches=0 .fctl_cnt=0 .send_cnt=0 .tree_cnt=1 .orph_cnt=0 .sub=1 .dsi=1 .dbuf=0@(nil)+0/0 .mbuf=[1..1|32],h=[0@(nil)+0/0],t=[0@(nil)+0/0] .task=(nil) conn.flg=0x000300

Just with this in the front and back proxies respectively:
  log-format "$HAPROXY_HTTP_LOG_FMT %[bs.debug_str(15)]"
  log-format "$HAPROXY_HTTP_LOG_FMT %[fs.debug_str(15)]"

For now the mux only implements muxs, muxc, conn. Xprt is ignored.

MINOR: stconn: add a new pair of sf functions {bs,fs}.debug_str

These are passed to the underlying mux to retrieve debug information
at the mux level (stream/connection) as a string that's meant to be
added to logs.

The API is quite complex just because we can't pass any info to the
bottom function. So we construct a union and pass the argument as an
int, and expect the callee to fill that with its buffer in return.

Most likely the mux->ctl and ->sctl API should be reworked before
the release to simplify this.

The functions take an optional argument that is a bit mask of the
layers to dump:
  muxs=1
  muxc=2
  xprt=4
  conn=8
  sock=16

The default (0) logs everything available.

DOC: configuration: fix alphabetical ordering of {bs,fs}.aborted

These must be before {bs,fs}.id, not after. Should be backported wherever
068ce2d5d2 ("MINOR: stconn: Add samples to retrieve about stream aborts")
is (normally 3.0).

MINOR: quic: enforce ACK reception is handled in order

Add a new BUG_ON() in qc-stream_desc_ack(). It ensures that
acknowledgement are always notify in-order. This is because out-of-order
ACKs cannot be handled by qc_stream_desc layer which does not support
gap in STREAM sent data.

Prior to this fix, out-of-order ACKs are simply ignored without any
error. This currently cannot happen thanks to careful
qc_stream_desc_ack() invokation. If this assumption is broken in the
future by inatteion, this would cause loss of ACK notification which
will prevent qc_stream_desc release.

BUG/MEDIUM: quic: handle retransmit for standalone FIN STREAM

STREAM frames have dedicated handling on retransmission. A special check
is done to remove data already acked in case of duplicated frames, thus
only unacked data are retransmitted.

This handling is faulty in case of an empty STREAM frame with FIN set.
On retransmission, this frame does not cover any unacked range as it is
empty and is thus discarded. This may cause the transfer to freeze with
the client waiting indefinitely for the FIN notification.

To handle retransmission of empty FIN STREAM frame, qc_stream_desc layer
have been extended. A new flag QC_SD_FL_WAIT_FOR_FIN is set by MUX QUIC
when FIN has been transmitted. If set, it prevents qc_stream_desc to be
freed until FIN is acknowledged. On retransmission side,
qc_stream_frm_is_acked() has been updated. It now reports false if
FIN bit is set on the frame and qc_stream_desc has QC_SD_FL_WAIT_FOR_FIN
set.

This must be backported up to 2.6. However, this modifies heavily
critical section for ACK handling and retransmission. As such, it must
be backported only after a period of observation.

This issue can be reproduced by using the following socat command as
server to add delay between the response and connection closure :
$ socat TCP-LISTEN:<port>,fork,reuseaddr,crlf SYSTEM:'echo "HTTP/1.1 200 OK"; echo ""; sleep 1;'

On the client side, ngtcp2 can be used to simulate packet drop. Without
this patch, connection will be interrupted on QUIC idle timeout or
haproxy client timeout with ERR_DRAINING on ngtcp2 :
$ ngtcp2-client --exit-on-all-streams-close -r 0.3 <host> <port> "http://<host>:<port>/?s=32o"

Alternatively to ngtcp2 random loss, an extra haproxy patch can also be
used to force skipping the emission of the empty STREAM frame :

diff --git a/include/haproxy/quic_tx-t.h b/include/haproxy/quic_tx-t.h
index efbdfe687..1ff899acd 100644
--- a/include/haproxy/quic_tx-t.h
+++ b/include/haproxy/quic_tx-t.h
@@ -26,6 +26,8 @@ extern struct pool_head *pool_head_quic_cc_buf;
/* Flag a sent packet as being probing with old data */
#define QUIC_FL_TX_PACKET_PROBE_WITH_OLD_DATA (1UL << 5)

+#define QUIC_FL_TX_PACKET_SKIP_SENDTO (1UL << 6)
+
/* Structure to store enough information about TX QUIC packets. */
struct quic_tx_packet {
/* List entry point. */
diff --git a/src/quic_tx.c b/src/quic_tx.c
index 2f199ac3c..2702fc9b9 100644
--- a/src/quic_tx.c
+++ b/src/quic_tx.c
@@ -318,7 +318,7 @@ static int qc_send_ppkts(struct buffer *buf, struct ssl_sock_ctx *ctx)
tmpbuf.size = tmpbuf.data = dglen;

TRACE_PROTO("TX dgram", QUIC_EV_CONN_SPPKTS, qc);
- if (!skip_sendto) {
+ if (!skip_sendto && !(first_pkt->flags & QUIC_FL_TX_PACKET_SKIP_SENDTO)) {
int ret = qc_snd_buf(qc, &tmpbuf, tmpbuf.data, 0, gso);
if (ret < 0) {
if (gso && ret == -EIO) {
@@ -354,6 +354,7 @@ static int qc_send_ppkts(struct buffer *buf, struct ssl_sock_ctx *ctx)
qc->cntrs.sent_bytes_gso += ret;
}
}
+ first_pkt->flags &= ~QUIC_FL_TX_PACKET_SKIP_SENDTO;

b_del(buf, dglen + QUIC_DGRAM_HEADLEN);
qc->bytes.tx += tmpbuf.data;
@@ -2066,6 +2067,17 @@ static int qc_do_build_pkt(unsigned char *pos, const unsigned char *end,
continue;
}

+ switch (cf->type) {
+ case QUIC_FT_STREAM_8 ... QUIC_FT_STREAM_F:
+ if (!cf->stream.len && (qc->flags & QUIC_FL_CONN_TX_MUX_CONTEXT)) {
+ TRACE_USER("artificially drop packet with empty STREAM frame", QUIC_EV_CONN_TXPKT, qc);
+ pkt->flags |= QUIC_FL_TX_PACKET_SKIP_SENDTO;
+ }
+ break;
+ default:
+ break;
+ }
+
quic_tx_packet_refinc(pkt);
cf->pkt = pkt;
}

MINOR: quic: implement function to check if STREAM is fully acked

When a STREAM frame is retransmitted, a check is performed to remove
range of data already acked from it. This is useful when STREAM frames
are duplicated and splitted to cover different data ranges. The newly
retransmitted frame contains only unacked data.

This process is performed similarly in qc_dup_pkt_frms() and
qc_build_frms(). Refactor the code into a new function named
qc_stream_frm_is_acked(). It returns true if frame data are already
fully acked and retransmission can be avoided. If only a partial range
of data is acknowledged, frame content is updated to only cover the
unacked data.

This patch does not have any functional change. However, it simplifies
retransmission for STREAM frames. Also, it will be reused to fix
retransmission for empty STREAM frames with FIN set from the following
patch :
BUG/MEDIUM: quic: handle retransmit for standalone FIN STREAM

As such, it must be backported prior to it.

MINOR: quic: convert qc_stream_desc release field to flags

qc_stream_desc had a field <release> used as a boolean. Convert it with
a new <flags> field and QC_SD_FL_RELEASE value as equivalent.

The purpose of this patch is to be able to extend qc_stream_desc by
adding newer flags values. This patch is required for the following
patch
BUG/MEDIUM: quic: handle retransmit for standalone FIN STREAM

As such, it must be backported prior to it.

BUG/MEDIUM: server/addr: fix tune.events.max-events-at-once event miss and leak

An issue has been introduced with cd99440 ("BUG/MAJOR: server/addr: fix
a race during server addr:svc_port updates").

Indeed, in the above commit we implemented the atomic_sync task which is
responsible for consuming pending server events to apply the changes
atomically. For now only server's addr updates are concerned.

To prevent the task from causing contention, a budget was assigned to it.
It can be controlled with the global tunable
'tune.events.max-events-at-once': the task may not process more than this
number of events at once.

However, a bug was introduced with this budget logic: each time the task
has to be interrupted because it runs out of budget, we reschedule the
task to finish where it left off, but the current event which was already
removed from the queue wasn't processed yet. This means that this pending
event (each tune.events.max-events-at-once) is effectively lost.

When the atomic_sync task deals with large number of concurrent events,
this bug has 2 known consequences: first a server's addr/port update
will be lost every 'tune.events.max-events-at-once'. This can of course
cause reliability issues because if the event is not republished
periodically, the server could stay in a stale state for indefinite amount
of time. This is the case when the DNS server flaps for instance: some
servers may not come back UP after the incident as described in GH #2666.

Another issue is that the lost event was not cleaned up, resulting in a
small memory leak. So in the end, it means that the bug is likely to
cause more and more degradation over time until haproxy is restarted.

As a workaround, 'tune.events.max-events-at-once' may be set to the
maximum number of events expected per batch. Note however that this value
cannot exceed 10 000, otherwise it could cause the watchdog to trigger due
to the task being busy for too long and preventing other threads from
making any progress. Setting higher values may not be optimal for common
workloads so it should only be used to mitigate the bug while waiting for
this fix.

Since tune.events.max-events-at-once defaults to 100, this bug only
affects configs that involve more than 100 servers whose addr:port
properties are likely to be updated at the same time (batched updates
from cli, lua, dns..)

To fix the bug, we move the budget check after the current event is fully
handled. For that we went from a basic 'while' to 'do..while' loop as we
assume from the config that 'tune.events.max-events-at-once' cannot be 0.
While at it, we reschedule the task once thread isolation ends (it was not
required to perform the reschedule while under isolation) to give the hand
back faster to waiting threads.

This patch should be backported up to 2.9 with cd99440. It should fix
GH #2666.

BUG/MINOR: fcgi-app: handle a possible strdup() failure

This defect was found by the coccinelle script "unchecked-strdup.cocci".
It can be backported to 2.2.

DEV: coccinelle: add a test to detect unchecked strdup()

The coccinelle test "unchecked-strdup.cocci" detects various cases of
unchecked strdup().

BUG/MINOR: quic: Too short datagram during packet building failures (aws-lc only)

This issue was reported by Ilya (@Chipitsine) when building haproxy against
aws-lc in GH #2663 where handshakeloss and handshakecorruption interop tests could
lead haproxy to crash after having built too short datagrams:

FATAL: bug condition "first_pkt->type == QUIC_PACKET_TYPE_INITIAL && (first_pkt->flags & (1UL << 0)) && length < 1200" matched at src/quic_tx.c:163
call trace(13):
| 0x55f4ee4dcc02 [ba d9 00 00 00 48 8d 35]: main-0x195bf2
| 0x55f4ee4e3112 [83 3d 2f 16 35 00 00 0f]: qc_send+0x11f3/0x1b5d
| 0x55f4ee4e9ab4 [85 c0 0f 85 00 f6 ff ff]: quic_conn_io_cb+0xab1/0xf1c
| 0x55f4ee6efa82 [48 c7 c0 f8 55 ff ff 64]: run_tasks_from_lists+0x173/0x9c2
| 0x55f4ee6f05d3 [8b 7d a0 29 c7 85 ff 0f]: process_runnable_tasks+0x302/0x6e6
| 0x55f4ee671bb7 [83 3d 86 72 44 00 01 0f]: run_poll_loop+0x6e/0x57b
| 0x55f4ee672367 [48 8b 1d 22 d4 1d 00 48]: main-0x48d
| 0x55f4ee6755e0 [b8 00 00 00 00 e8 08 61]: main+0x2dec/0x335d

This could happen after Handshake packet building failures which follow a successful
Initial packet into the same datagram. In this case, the datagram could be emitted
with a too short length (<1200 bytes).

To fix this, store the datagram only if the first packet is not an Initial packet
or if its length is big enough (>=1200 bytes).

Must be backported as far as 2.6.

CI: harden NetBSD builds by ERR=1

Add ERR=1 build option to the NetBSD build from github.

CI: add weekly QUIC Interop regression against AWS-LC

currently only quic-go and picoquic clients are enabled.
Tests will be run weekly.

BUG/MINOR: quic: Too shord datagram during O-RTT handshakes (aws-lc only)

By "aws-lc only", one means that this bug was first revealed by aws-lc stack.
This does not mean it will not appeared for new versions of other TLS stacks which
have never revealed this bug.

This bug was reported by Ilya (@chipitsine) in GH #2657 where some QUIC interop
tests (resumption, zerortt) could lead to crash with haproxy compiled against
aws-lc TLS stack. These crashed were triggered by this BUG_ON() which detects
that too short datagrams with at least one ack-eliciting Initial packet inside
could be built.

  <0>2024-07-31T15:13:42.562717+02:00 [01|quic|5|quic_tx.c:739] qc_prep_pkts():
  next encryption level : qc@0x61d000041080 idle_timer_task@0x60d000006b80 flags=0x6000058

  FATAL: bug condition "first_pkt->type == QUIC_PACKET_TYPE_INITIAL && (first_pkt->flags & (1UL << 0)) && length < 1200" matched at src/quic_tx.c:163
  call trace(12):
  | 0x563ea447bc02 [ba d9 00 00 00 48 8d 35]: main-0x1958ce
  | 0x563ea4482703 [e9 73 fe ff ff ba 03 00]: qc_send+0x17e4/0x1b5d
  | 0x563ea4488ab4 [85 c0 0f 85 00 f6 ff ff]: quic_conn_io_cb+0xab1/0xf1c
  | 0x563ea468e6f9 [48 c7 c0 f8 55 ff ff 64]: run_tasks_from_lists+0x173/0x9c2
  | 0x563ea468f24a [8b 7d a0 29 c7 85 ff 0f]: process_runnable_tasks+0x302/0x6e6
  | 0x563ea4610893 [83 3d aa 65 44 00 01 0f]: run_poll_loop+0x6e/0x57b
  | 0x563ea4611043 [48 8b 1d 46 c7 1d 00 48]: main-0x48d
  | 0x7f64d05fb609 [64 48 89 04 25 30 06 00]: libpthread:+0x8609
  | 0x7f64d0520353 [48 89 c7 b8 3c 00 00 00]: libc:clone+0x43/0x5e

That said everything was correctly done by qc_prep_ptks() to prevent such a case.
But this relied on the hypothesis that the list of encryption levels it used
was always built in the same order as follows for 0-RTT sessions:

    initial, early-data, handshake, application

But this order is determined but the order the TLS stack derives the secrets
for these encryption levels. For aws-lc, this order is not the same but
as follows:

    initial, handshake, application, early-data

During 0-RTT sessions, the server may have to build three ack-eliciting packets
(with CRYPTO data inside) to reply to the first client packet: initial, hanshake,
application. qc_prep_pkts() adds a PADDING frame to the last built packet
for the last encryption level in the list. But after application level encryption,
there is early-data encryption level. This prevented qc_prep_pkts() to build
a padded applicaiton level last packet to send a 1200-bytes datagram.

To fix this, always insert early-data encryption level after the initial
encryption level into the encryption levels list when initializing this encryption
level from quic_conn_enc_level_init().

Must be backported as far as 2.9.

BUG/MEDIUM: peer: Notify the applet won't consume data when it waits for sync

When the peer applet is waiting for a synchronisation with the global sync
task, we must notify it won't consume data. Otherwise, if some data are
already waiting in the input buffer, the applet will be woken up in loop and
this wil trigger the watchdog. Once synchronized, the applet is woken up. In
that case, the peer applet must indicate it is going to consume data again.

This patch should fix the issue #2656. It must be backported to 3.0.

BUG/MEDIUM: mux-h2: Propagate term flags to SE on error in h2s_wake_one_stream

When a stream is explicitly woken up by the H2 conneciton, if an error
condition is detected, the corresponding error flag is set on the SE. So
SE_FL_ERROR or SE_FL_ERR_PENDING, depending if the end of stream was
reported or not.

However, there is no attempt to propagate other termination flags. We must
be sure to properly set SE_FL_EOI and SE_FL_EOS when appropriate to be able
to switch a pending error to a fatal error.

Because of this bug, the SE remains with a pending error and no end of
stream, preventing the applicative stream to trully abort it. It means on
some abort scenario, it is possible to block a stream infinitely.

This patch must be backported at least as far as 2.8. No bug was observed on
older versions while the same code is inuse.

BUG/MEDIUM: h2: Only report early HTX EOM for tunneled streams

For regular H2 messages, the HTX EOM flag is synonymous the end of input. So
SE_FL_EOI flag must also be set on the stream-endpoint descriptor. However,
there is an exception. For tunneled streams, the end of message is reported
on the HTX message just after the headers. But in that case, no end of input
is reported on the SE.

But here, there is a bug. The "early" EOM is also report on the HTX messages
when there is no payload (for instance a content-length set to 0). If there
is no ES flag on the H2 HEADERS frame, it is an unexpected case. Because for
the applicative stream and most probably for the opposite endpoint, the
message is considered as finihsed. It is switched in its DONE state (or the
equivalent on the endpoint). But, if an extra H2 frame with the ES flag is
received, a TRAILERS frame or an emtpy DATA frame, an extra EOT HTX block is
pushed to carry the HTX EOM flag. So an extra HTX block is emitted for a
regular HTX message. It is totally invalid, it must never happen.

Because it is an undefined behavior, it is difficult to predict the result.
But it definitly prevent the applicative stream to properly handle aborts
and errors because data remain blocked in the channel buffer. Indeed, the
end of the message was seen, so no more data are forwarded.

It seems to be an issue for 2.8 and upper. Harder to evaluate for older
versions.

This patch must be backported as far as 2.4.

BUG/MEDIUM: http-ana: Report error on write error waiting for the response

When we are waiting for the server response, if an error is pending on the
frontend side (a write error on client), it is handled as an abort and all
regular response analyzers are removed, except the one responsible to
release the filters, if any. However, while it is handled as an abort, the
error is not reported, as usual, via http_reply_and_close() function. It is
an issue because in that, the channels buffers are not reset.

Because of this bug, it is possible to block a stream infinitely. The
request side is waiting for the response side and the response side is
blocked because filters must be released and this cannot be done because
data remain blocked in channels buffers.

So, in that case, calling http_reply_and_close() with no message is enough
to unblock the stream.

This patch must be backported as far as 2.8.

BUG/MINOR: h2: reject extended connect for h2c protocol

This commit prevents forwarding of an HTTP/2 Extended CONNECT when "h2c"
or "h2" token is set as targetted protocol. Contrary to the previous
commit which deals with HTTP/1 mux, this time the request is rejected
and a RESET_STREAM is reported to the client.

This must be backported up to 2.4 after a period of observation.

BUG/MINOR: h1: do not forward h2c upgrade header token

haproxy supports tunnel establishment through HTTP Upgrade mechanism.
Since the following commit, extended CONNECT is also supported for
HTTP/2 both on frontend and backend side.

commit 9bf957335e2c385b74901481f7a89c9565dfce53
MEDIUM: mux_h2: generate Extended CONNECT from htx upgrade

As specified by HTTP/2 rfc, "h2c" can be used by an HTTP/1.1 client to
request an upgrade to HTTP/2. In haproxy, this is not supported so it
silently ignores this. However, Connection and Upgrade headers are
forwarded as-is on the backend side.

If using HTTP/1 on the backend side and the server supports this upgrade
mechanism, haproxy won't be able to parse the HTTP response. If using
HTTP/2, mux backend tries to incorrectly convert the request to an
Extended CONNECT with h2c protocol, which may also prevent the response
to be transmitted.

To fix this, flag HTTP/1 request with "h2c" or "h2" token in an upgrade
header. On converting the header list to HTX, the upgrade header is
skipped if any of this token is present and the H1_MF_CONN_UPG flag is
removed.

This issue can easily be reproduced using curl --http2 argument to
connect to an HTTP/1 frontend.

This must be backported up to 2.4 after a period of observation.

BUG/MIONR: quic: fix fc_lost

Control layer callback get_info has recently been implemented for QUIC.
However, fc_lost always returned 0. This is because quic_get_info() does
not use the correct input argument value to identify lost value.

This does not need to be backported.

BUG/MINOR: quic: fix fc_rtt/srtt values

QUIC has recently implement get_info callback to return RTT/sRTT values.
However, it uses milliseconds, contrary to TCP which uses microseconds.
This cause smp fetch functions to return invalid values. Fix this by
converting QUIC values to microseconds.

This does not need to be backported.

MINOR: flags/mux-quic: decode qcc and qcs flags

Decode QUIC MUX connection and stream elements via qcc_show_flags() and
qcs_show_flags(). Flags definition have been moved outside of USE_QUIC
to ease compilation of flags binary.

MINOR: quic: Define ->get_info() control layer callback for QUIC

This low level callback may be called by several sample fetches for
frontend connections like "fc_rtt", "fc_rttvar" etc.
Define this callback for QUIC protocol as pointer to quic_get_info().
This latter supports these sample fetches:
"fc_lost", "fc_reordering", "fc_rtt" and "fc_rttvar".

Update the documentation consequently.

MINOR: tcp_sample: Move TCP low level sample fetch function to control layer

Add ->get_info() new control layer callback definition to protocol struct to
retreive statiscal counters information at transport layer (TCPv4/TCPv6) identified by
an integer into a long long int.
Move the TCP specific code from get_tcp_info() to the tcp_get_info() control layer
function (src/proto_tcp.c) and define it as the ->get_info() callback for
TCPv4 and TCPv6.
Note that get_tcp_info() is called for several TCP sample fetches.
This patch is useful to support some of these sample fetches for QUIC and to
keep the code simple and easy to maintain.

BUG/MEDIUM: quic: prevent conn freeze on 0RTT undeciphered content

Received QUIC packets are stored in quic_conn Rx buffer after header
protection removal in qc_rx_pkt_handle(). These packets are then removed
after quic_conn IO handler via qc_treat_rx_pkts().

If HP cannot be removed, packets are still copied into quic_conn Rx
buffer. This can happen if encryption level TLS keys are not yet
available. The packet remains in the buffer until HP can be removed and
its content processed.

An issue occurs if client emits a 0-RTT packet but haproxy does not have
the shared secret, for example after a haproxy process restart. In this
case, the packet is copied in quic_conn Rx buffer but its HP won't ever
be removed. This prevents the buffer to be purged. After some time, if
the client has emitted enough packets, Rx buffer won't have any space
left and received packets are dropped. This will cause the connection to
freeze.

To fix this, remove any 0-RTT buffered packets on handshake completion.
At this stage, 0-RTT packets are unnecessary anymore. The client is
expected to reemit its content in 1-RTT packet which are properly
deciphered.

This can easily reproduce with HTTP/3 POST requests or retrieving a big
enough object, which will fill the Rx buffer with ACK frames. Here is a
picoquic command to provoke the issue on haproxy startup :

$ picoquicdemo -Q -v 00000001 -a h3 <hostname> 20443 "/?s=1g"

Note that allow-0rtt must be present on the bind line to trigger the
issue. Else haproxy will reject any 0-RTT packets.

This must be backported up to 2.6.

This could be one of the reason for github issue #2549 but it's unsure
for now.

BUILD: ssl: replace USE_OPENSSL_AWSLC by OPENSSL_IS_AWSLC

Replace USE_OPENSSL_AWSLC by OPENSSL_IS_AWSLC in the code source, so we
won't need to set USE_OPENSSL_AWSLC in the Makefile on the long term.

BUG/MEDIUM: ssl: 0-RTT initialized at the wrong place for AWS-LC

Revert patch fcc8255 "MINOR: ssl_sock: Early data disabled during
SSL_CTX switching (aws-lc)". The patch was done in the wrong callback
which is never built for AWS-LC, and applies options on the SSL_CTX
instead of the SSL, which should never be done elsewhere than in the
configuration parsing.

This was probably triggered by successfully linking haproxy against
AWS-LC without using USE_OPENSSL_AWSLC.

The patch also reintroduced SSL_CTX_set_early_data_enabled() in the
ssl_quic_initial_ctx() and ssl_sock_initial_ctx(). So the initial_ctx
does have the right setting, but it still needs to be applied to the
selected SSL_CTX in the clienthello, because we need it on the selected
SSL_CTX.

Must be backported to 3.0. (ssl_clienthello.c part was in ssl_sock.c)

BUG/MEDIUM: ssl: reactivate 0-RTT for AWS-LC

Then reactivate HAVE_SSL_0RTT and HAVE_SSL_0RTT_QUIC for AWS-LC, which
were wrongly deactivated in f5353f2c ("MINOR: ssl: add HAVE_SSL_0RTT
constant").

Must be backported to 3.0.

BUG/MINOR: stconn: bs.id and fs.id had their dependencies incorrect

The backend depends on the response and the frontend on the request, not
the other way around. In addition, they used to depend on L6 (hence
contents in the channel buffers) while they should only depend on L5
(permanent info known in the mux).

This came in 2.9 with commit 24059615a7 ("MINOR: Add sample fetches to
get the frontend and backend stream ID") so this can be backported there.

(cherry picked from commit 61dd0156c82ea051779e6524cad403871c31fc5a)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUILD: mux-pt: Use the right name for the sedesc variable

A typo was introduced in 760d26a86 ("BUG/MEDIUM: mux-pt/mux-h1: Release the
pipe on connection error on sending path"). The sedesc variable is 'sd', not
'se'.

This patch must be backported with the commit above.

BUG/MEDIUM: mux-pt/mux-h1: Release the pipe on connection error on sending path

When data are sent using the kernel splicing, if a connection error
occurred, the pipe must be released. Indeed, in that case, no more data can
be sent and there is no reason to not release the pipe. But it is in fact an
issue for the stream because the channel will appear are not empty. This may
prevent the stream to be released. This happens on 2.8 when a filter is also
attached on it. On 2.9 and upper, it seems there is not issue. But it is
hard to be sure and the current patch remains valid is all cases. On 2.6 and
lower, the code is not the same and, AFAIK, there is no issue.

This patch must be backported to 2.8. However, on 2.8, there is no zero-copy
data forwarding. The patch must be adapted. There is no done_ff/resume_ff
callback functions for muxes. The pipe must released in sc_conn_send() when
an error flag is set on the SE, after the call to snd_pipe callback
function.

BUG/MEDIUM: stconn: Report error on SC on send if a previous SE error was set

When a send on a connection is performed, if a SE error (or a pending error)
was already reported earlier, we leave immediately. No send is performed.
However, we must be sure to report the error at the SC level if necessary.
Indeed, the SE error may have been reported during the zero-copy data
forwarding. So during receive on the opposite side. In that case, we may
have missed the opportunity to report it at the SC level.

The patch must be backported as far as 2.8.

DOC: config: Add documentation about spop mode for backends

The SPOE was refactored. Now backends referenced by a SPOE filter must use
the spop mode to be able to use the spop multiplexer for server connections.
The "spop" mode was added in the list of supported mode for backends.

BUG/MEDIUM: queue: deal with a rare TOCTOU in assign_server_and_queue()

After checking that a server or backend is full, it remains possible
to call pendconn_add() just after the last pending requests finishes, so
that there's no more connection on the server for very low maxconn (typ 1),
leaving new ones in queue till the timeout.

The approach depends on where the request was queued, though:
  - when queued on a server, we can simply detect that we may dequeue
    pending requests and wake them up, it will wake our request and
    that's fine. This needs to be done in srv_redispatch_connect() when
    the server is set.

  - when queued on a backend, it means that all servers are done with
    their requests. It means that all servers were full before the
    check and all were empty after. In practice this will only concern
    configs with less servers than threads. It's where the issue was
    first spotted, and it's very hard to reproduce with more than one
    server. In this case we need to load-balance again in order to find
    a spare server (or even to fail). For this, we call the newly added
    dedicated function pendconn_must_try_again() that tells whether or
    not a blocked pending request was dequeued and needs to be retried.

This should be backported along with pendconn_must_try_again() to all
stable versions, but with extreme care because over time the queue's
locking evolved.

MINOR: queue: add a function to check for TOCTOU after queueing

There's a rare TOCTOU case that happens from time to time with maxconn 1
and multiple threads. Between the moment we see the queue full and the
moment we queue a request, it's possible that the last request on the
server or proxy ended and that no other one is left to offer it its place.

Given that all this code path is performance-critical and we cannot afford
to increase the lock duration, better recheck for the condition after
queueing. For this we need to be able to check for the condition and
cleanly dequeue a request. That's what this patch provides via the new
function pendconn_must_try_again(). It will catch more requests than
absolutely needed though it will catch them all. It may find that around
1/1000 of requests are at risk, though testing shows that in practice,
it's around 1 per million that really gets stuck (other ones benefit
from timing and finishing late requests). Maybe in the future some
conditions might be refined but it's harmless.

What happens to such requests is that they're dequeued and their pendconn
freed, so that the caller can decide to try to LB or queue them again. For
now the function is not used, it's just added separately for easier tracking.

BUILD: cfgparse-quic: fix build error on Solaris due to missing netinet/in.h

Since commit 35470d518 ("MINOR: quic: activate UDP GSO for QUIC if
supported"), Solaris build fails due to netinet/udp.h being included
without netinet/in.h. Adding it is sufficient to fix the problem. No
backport is needed.

BUG/MEDIUM: jwt: Clear SSL error queue on error when checking the signature

When the signature included in a JWT is verified, if an error occurred, one
or more SSL errors are queued and never cleared. These errors may be then
caught by the SSL stack and a fatal SSL error may be erroneously reported
during a SSL received or send.

So we must take care to clear the SSL error queue when the signature
verification failed.

This patch should fix issue #2643. It must be backported as far as 2.6.

MINOR: quic: Dump TX in flight bytes vs window values ratio.

Display the ratio of the numbers of bytes in flight by packet number spaces
versus the current window values in percent.

MINOR: quic: Add information to "show quic" for CUBIC cc.

Add ->state_cli() new callback to quic_cc_algo struct to define a
function called by the "show quic (cc|full)" commands to dump some information
about the congestion algorithm internal state currently in use by the QUIC
connections.

Implement this callback for CUBIC algorithm to dump its internal variables:
   - K: (the time to reach the cubic curve inflexion point),
   - last_w_max: the last maximum window value reached before intering
     the last recovery period. This is also the window value at the
     inflexion point of the cubic curve,
   - wdiff: the difference between the current window value and last_w_max.
     So negative before the inflexion point, and positive after.

MEDIUM: h1: allow to preserve keep-alive on T-E + C-L

In 2.5-dev9, commit 631c7e866 ("MEDIUM: h1: Force close mode for invalid
uses of T-E header") enforced a recently arrived new security rule in the
HTTP specification aiming at preventing a class of content-smuggling
attacks involving HTTP/1.0 agents. It consists in handling the very rare
T-E + C-L requests or responses in close mode.

It happens it does have an impact of a rare few and very old clients
(probably running insecure TLS stacks by the way) that continue to send
both with their POST requests. The impact is that for each and every
request they'll have to reconnect, possibly negotiating a full TLS
handshake that becomes harmful to the machine in terms of CPU computation.

This commit adds a new option "h1-do-not-close-on-insecure-transfer-encoding"
that does exactly what it says, it just asks not to close on such messages,
even though the message continues to be sanitized and C-L dropped. It means
that the risk is only between the sender and haproxy, which is limited, and
might be the only acceptable solution for such environments having to deal
with broken implementations.

The cases are so rare that it should not need to be backported, or in the
worst case, to the latest LTS if there is any demand.

BUG/MEDIUM: quic: fix invalid conn reject with CONNECTION_REFUSED

quic-initial rules were implemented just recently. For some actions, a
new flags field was added in quic_dgram structure. This is used to
report the result of the rules execution.

However, this flags field was left uninitialized. Depending on its
value, it may close the connection to be wrongly rejected via
CONNECTION_REFUSED. Fix this by properly set flags value to 0.

No need to backport.

MINOR: quic: implement send-retry quic-initial rules

Define a new quic-initial "send-retry" rule. This allows to force the
emission of a Retry packet on an initial without token instead of
instantiating a new QUIC connection.

MINOR: quic: implement reject quic-initial action

Define a new quic-initial action named "reject". Contrary to dgram-drop,
the client is notified of the rejection by a CONNECTION_CLOSE with
CONNECTION_REFUSED error code.

To be able to emit the necessary CONNECTION_CLOSE frame, quic_conn is
instantiated, contrary to dgram-drop action. quic_set_connection_close()
is called immediatly after qc_new_conn() which prevents the handshake
startup.

MINOR: quic: pass quic_dgram as obj_type for quic-initial rules

To extend quic-initial rules, pass quic_dgram instance to argument for
the various actions. As such, quic_dgram is now supported as an obj_type
and can be used in session origin field.

MINOR: quic: support ACL for quic-initial rules

Add ACL condition support for quic-initial rules. This requires the
extension of quic_parse_quic_initial() to parse an extra if/unless
block.

Only layer4 client samples are allowed to be used with quic-initial
rules. However, due to the early execution of quic-initial rules prior
to any connection instantiation, some samples are non supported.

To be able to use the 4 described samples, a dummy session is
instantiated before quic-initial rules execution. Its src and dst fields
are set from the received datagram values.

MEDIUM: quic: implement quic-initial rules

Implement a new set of rules labelled as quic-initial.

These rules as specific to QUIC. They are scheduled to be executed early
on Initial packet parsing, prior a new QUIC connection instantiation.
Contrary to tcp-request connection, this allows to reject traffic
earlier, most notably by avoiding unnecessary QUIC SSL handshake
processing.

A new module quic_rules is created. Its main function
quic_init_exec_rules() is called on Initial packet parsing in function
quic_rx_pkt_retrieve_conn().

For the moment, only "accept" and "dgram-drop" are valid actions. Both
are final. The latter drops silently the Initial packet instead of
allocating a new QUIC connection.

MINOR: quic: delay Retry emission on quic-force-retry

Currently, quic Retry packets are emitted for two different reasons
after processing an Initial without token :
- quic-force-retry is set on bind-line
- an abnormal number of half-open connection is currently detected

Previously, these two conditions were checked separately in different
functions during datagram parsing. Uniformize this by moving
quic-force-retry check in quic_rx_pkt_retrieve_conn() along the second
condition check.

The purpose of this patch is to uniformize datagram parsing stages. It
is necessary to implement quic-initial rules in
quic_rx_pkt_retrieve_conn() prior to any Retry emission. This prevents
to emit unnecessary Retry if an Initial is subject to a reject rule.

MEDIUM: sink: assume sft appctx stickiness

As mentioned in b40d804 ("MINOR: sink: add some comments about sft->appctx
usage in applet handlers"), there are few places in the code where it
looks like we assumed that the applet callbacks such as
sink_forward_session_init() or sink_forward_io_handler() could be
executing an appctx whose sft is detached from the appctx
(appctx != sft->appctx).

In practise this should not be happening since an appctx sticks to the
same thread its entire lifetime, and the only times sft->appctx is
effectively assigned is during the session/appctx creation (in
process_sink_forward()) or release.

Thus if sft->appctx wouldn't point to the appctx that the sft was bound
to after appctx creation, it would probably indicate a bug rather than
an expected condition. To further emphasize that and prevent the
confusion, and since 3.1-dev4 was released, let's remove such checks and
instead add a BUG_ON to ensure this never happens.

In _sink_forward_io_handler(), the "hard_close" label was removed since
there are no more uses for it (no hard errors may be caught from the
function for now)

MEDIUM: quic: implement CHACHA20_POLY1305 for AWS-LC

With AWS-LC, the aead part is covered by the EVP_AEAD API which
provides the correct EVP_aead_chacha20_poly1305(), however for header
protection it does not provides an EVP_CIPHER for chacha20.

This patch implements exceptions in the header protection code and use
EVP_CIPHER_CHACHA20 and EVP_CIPHER_CTX_CHACHA20 placeholders so we can
use the CRYPTO_chacha_20() primitive manually instead of the EVP_CIPHER
API.

This requires to check if we are using EVP_CIPHER_CTX_CHACHA20 when
doing EVP_CIPHER_CTX_free().

MEDIUM: quic: add key argument to header protection crypto functions

In order to prepare the code for using Chacha20 with the EVP_AEAD API,
both quic_tls_hp_decrypt() and quic_tls_hp_encrypt() need an extra key
argument.

Indeed Chacha20 does not exists as an EVP_CIPHER in AWS-LC, so the key
won't be embedded into the EVP_CIPHER_CTX, so we need an extra parameter
to use it.

MINOR: quic: rename confusing wording aes to hp

Some of the crypto functions used for headers protection in QUIC are
named with an "aes" name even thought they are not used for AES
encryption only.

This patch renames these "aes" to "hp" so it is clearer.

MEDIUM: ssl/quic: implement quic crypto with EVP_AEAD

The QUIC crypto is using the EVP_CIPHER API in order to achieve
authenticated encryption, this was the API which was used with OpenSSL.
With libraries that inspires from BoringSSL (libreSSL and AWS-LC), the
AEAD algorithms are implemented using the EVP_AEAD API.

This patch converts the call to the EVP_CIPHER API when called in the
contex of AEAD cryptography for QUIC.

The patch defines some QUIC_AEAD macros that can be either EVP_CIPHER or
EVP_AEAD depending on the library.

This was mainly done for AWS-LC but this could be useful for other
libraries. This should finally allow to use CHACHA20_POLY1305 with
AWS-LC.

This patch allows to use the following ciphers with the EVP_AEAD API:
- TLS1_3_CK_AES_128_GCM_SHA256
- TLS1_3_CK_AES_256_GCM_SHA384

AWS-LC does not implement TLS1_3_CK_AES_128_CCM_SHA256 and
TLS1_3_CK_CHACHA20_POLY1305_SHA256 requires some hack for headers
protection which will come in another patch.

BUG/MINOR: quic: Lack of precision when computing K (cubic only cc)

K cubic variable is stored in ms. But it was a formula with the second as unit
for the window difference parameter which was used to compute K without
considering the loss of information. Then the result was converted in ms (K *= 1000).
This leaded to a lack of precision and multiples of 1000 as values.

To fix this, use the same formula but with the window difference in ms as parameter
passed to the cubic function and remove the conversion.

Must be backported as far as 2.6.

[RELEASE] Released version 3.1-dev4

Released version 3.1-dev4 with the following main changes :
    - MINOR: limits: prepare to keep limits in one place
    - REORG: fd: move raise_rlim_nofile to limits
    - CLEANUP: fd: rm struct rlimit definition
    - REORG: global: move rlim_fd_*_at_boot in limits
    - MINOR: haproxy: prepare to move limits-related code
    - REORG: haproxy: move limits handlers to limits
    - MINOR: limits: add is_any_limit_configured
    - CLEANUP: quic: remove obsolete comment on send
    - MINOR: quic: extend detection of UDP API OS features
    - MINOR: quic: activate UDP GSO for QUIC if supported
    - MINOR: quic: define quic_cc_path MTU as constant
    - MINOR: quic: add GSO parameter on quic_sock send API
    - MAJOR: quic: support GSO when encoding datagrams
    - MEDIUM: quic: implement GSO fallback mechanism
    - MINOR: quic: add counters of sent bytes with and without GSO
    - BUG/MEDIUM: bwlim: Be sure to never set the analyze expiration date in past
    - CLEANUP: proto: rename TID affinity callbacks
    - CLEANUP: quic: rename TID affinity elements
    - BUG/MINOR: limits: fix license type in limits.h
    - BUG/MINOR: session: Eval L4/L5 rules defined in the default section
    - CLEANUP: stconn: Fix a typo in comments for SE_ABRT_SRC_*
    - MEDIUM: spoe: Remove fragmentation support
    - MEDIUM: spoe: Remove async mode support
    - MINOR: spoe: Use only a global engine-id per agent
    - MINOR: spoe: Remove debugging
    - MAJOR: spoe: Remove idle applets and pipelining support
    - MINOR: spoe: Remove the dedicated SPOE applet task
    - MEDIUM: proxy/spoe: Add a SPOP mode
    - MEDIUM: applet: Add a .shut callback function for applets
    - MINOR: connection: No longer include stconn type header in connection-t.h
    - MINOR: stconn: Use a dedicated function to get the opposite sedesc
    - MINOR: spoe: Rename some flags and constant to use SPOP prefix
    - MINOR: spoe: Dynamically alloc the message list per event of an agent
    - MINOR: spoe: Move all stuff regarding the filter/applet in the C file
    - MINOR: spoe: Move spoe_str_to_vsn() into the header file
    - MEDIUM: mux-spop: Introduce the SPOP multiplexer
    - MEDIUM: check/spoe: Use SPOP multiplexer to perform SPOP health-checks
    - MAJOR: spoe: Rewrite SPOE applet to use the SPOP mux
    - CLEANUP: spoe: Uniformize function definitions
    - MINOR: spoe: Add internal sample fetch to retrieve the SPOE engine ID
    - MEDIUM: spoe: Set a specific name for the connection pool of SPOP servers
    - MINOR: backend: Remove test on HTX streams to reuse idle connections on connect
    - MEDIUM: spoe: Force the reuse 'always' mode for SPOP backends
    - MINOR: mux-spop: Use a dedicated function to update the SPOP connection timeout
    - MAJOR: mux-spop: Make the SPOP connections reusable
    - MINOR: stats-html: Display reuse ratio for spop connections
    - MEDIUM: spoe: Directly xfer NOTIFY frame when SPOE applet is created
    - MEDIUM: spoe: Directly receive ACK frame in the SPOE context buffer
    - MEDIUM: mux-spop/spoe: Save negociated max-frame-size value in the mux
    - MINOR: spoe: Remove the spop version from the SPOE appctx context
    - MEDIUM: mux-spop: Add checks on received frames
    - MEDIUM: mux-spop: Announce the pipeling support if possible
    - MEDIUM: spoe: Forward SPOE context error to the SPOE applet
    - MEDIUM: spoe: Make the SPOE applet use its own buffers
    - DOC: spoe: Update SPOE documentation to reflect recent refactoring
    - BUILD: mux-spop: fix build failure on gcc 4-10 and clang
    - MINOR: fd: don't scan the full fdtab on all threads
    - MINOR: server: better mt_list usage for node migration (prev_deleted handling)
    - BUG/MINOR: do not close uninit FD in quic_test_socketops()
    - BUG/MEDIUM: debug/cli: fix "show threads" crashing with low thread counts
    - MINOR: debug: prepare feed_post_mortem_late
    - CLEANUP: debug: fix indents in debug_parse_cli_show_dev
    - MINOR: debug: store runtime uid/gid in postmortem
    - MINOR: debug: keep runtime capabilities in post_mortem
    - MINOR: debug: use LIM2A to show limits
    - MINOR: debug: prepare to show runtime limits
    - MINOR: debug: keep runtime limits in postmortem
    - DOC: install: don't reference removed CPU arg
    - BUG/MEDIUM: ssl_sock: fix deadlock in ssl_sock_load_ocsp() on error path
    - BUG/MAJOR: mux-h2: force a hard error upon short read with pending error
    - MEDIUM: sink: start applets asynchronously
    - OPTIM: sink: balance applets accross threads
    - MEDIUM: ocsp: fix ocsp when the chain is loaded from 'issuers-chain-path'
    - MEDIUM: ssl: add extra_chain to ckch_data
    - MINOR: ssl: change issuers-chain for show_cert_detail()
    - REGTESTS: ssl: test the issuers-chain-path keyword
    - DOC: configuration: issuers-chain-path not compatible with OCSP
    - DOC: configuration: issuers-chain-path is compatible with OCSP
    - BUG/MEDIUM: startup: fix zero-warning mode
    - BUILD: tree-wide: cast arguments to tolower/toupper to unsigned char (2)
    - MINOR: cfgparse-global: move mode's keywords in cfg_kw_list
    - MINOR: cfgparse-global: move no<poller_name> in cfg_kw_list
    - DOC: config: improve the http-keep-alive section
    - BUG/MINOR: stick-table: fix crash for src_inc_gpc() without stkcounter
    - BUG/MINOR: server: Don't warn fallback IP is used during init-addr resolution
    - BUG/MINOR: cli: Atomically inc the global request counter between CLI commands
    - MINOR: stream: Add a pointer to set the parent stream
    - MINOR: vars: Fill a description instead of hash and scope when a name is parsed
    - MINOR: vars: Use a description to set/unset a variable instead of its hash and scope
    - MEDIUM: vars: Be able to parse parent scopes for variables
    - MINOR: vars: Use a variable description to get variables of a specific scope
    - MEDIUM: vars: Be able to retrieve variable of the parent stream, if any
    - MEDIUM: spoe: Set the parent stream for SPOE streams
    - BUG/MINOR: quic: Non optimal first datagram.
    - DOC: config: Add a dedicated section about variables
    - DOC: config: Add info about variable scopes referencing the parent stream
    - DOC: config: Explicitly state the SPOE streams have a usable parent stream
    - MINOR: quic: Avoid cc priv buffer overflow.
    - MINOR: spoe: Add a function to validate a version is supported
    - MINOR: spoe: export the list of SPOP error reasons
    - MEDIUM: spoe/tcpcheck: Reintroduce SPOP check as a customized tcp-check
    - REGTESTS: check/spoe: Re-enable the script performing SPOP health-checks
    - BUG/MEDIUM: sink: properly init applet under sft lock
    - MINOR: sink: unify and sink_forward_io_handler() and sink_forward_oc_io_handler()
    - MINOR: sink: Remove useless test on SE_FL_SHR/SHW flags
    - MINOR: sink: merge sink_forward_io_handler() with sink_forward_oc_io_handler()
    - MINOR: sink: add some comments about sft->appctx usage in applet handlers
    - MINOR: sink: distinguish between hard and soft close in _sink_forward_io_handler()
    - MEDIUM: sink: don't set NOLINGER flag on the outgoing stream interface
    - MINOR: ring: count processed messages in ring_dispatch_messages()
    - MINOR: sink: add processed events counter in sft
    - MEDIUM: sink: "max-reuse" support for sink servers
    - OPTIM: sink: consider threads' current load when rebalancing applets

OPTIM: sink: consider threads' current load when rebalancing applets

In c454296f0 ("OPTIM: sink: balance applets accross threads"), we already
made sure to balance applets accross threads by picking a random thread
to spawn the new applet.

Also, thanks to the previous commit, we also have the ability to destroy
the applet when a certain amount of messages were processed to help
distribute the load during runtime.

Let's improve that by trying up to 3 different threads in the hope to pick
a non-overloaded one in the best scenario, and the least over loaded one
in the worst case. This should help to better distribute the load over
multiple threads when high loads are expected.

Logic was greatly inspired from thread migration logic used by server
health checks, but it was simpliflied for sink's use case.

MEDIUM: sink: "max-reuse" support for sink servers

Thanks to the previous commit, it is now possible to know how many events
were processed for a given sft/server sink pair. As mentioned in commit
c454296 ("OPTIM: sink: balance applets accross threads"), let's provide
the ability to restart a server connection when a certain amount of events
were processed to help better balance the load over multiple threads.

For this, we make use the of "max-reuse" server keyword which was only
relevant under "http" context so far. Under sink context, "max-reuse"
corresponds to the number of times the tcp connection can be reused
for sending messages, which in fact means that "max-reuse + 1" is the
number of events (ie: messages) that are allowed to be sent using the
same tcp server connection: when this threshold is met, the connection
will be destroyed and a new one will be created on a random thread.
The value is not strict: it is the minimum value above which the
connection may be destroyed since the value is checked after
ring_dispatch_messages() which may process multiple messages at once.

By default, no limit is enforced (the connection will be reused for as
long as it is available).

The documentation was updated accordingly.

MINOR: sink: add processed events counter in sft

Add a new struct member to sft structure named e_processed in order to
track the total number of events processed by sft applets.

sink_forward_oc_io_handler() and sink_forward_io_handler() now make use
of ring_dispatch_messages() optional value added in the previous commit
in order to increase the number of processed events.

MINOR: ring: count processed messages in ring_dispatch_messages()

ring_dispatch_messages() now takes an optional argument <processed> which
must point to a size_t counter when provided.

When provided, the value is updated to the number of messages processed
by the function.

MEDIUM: sink: don't set NOLINGER flag on the outgoing stream interface

Given that sink applets are responsible for conveying messages from the
ring to the tcp server endpoint, there are no protocol timeout or errors
expected there, it is an unidirectional flow of data over TCP.

As such, NOLINGER flag which was inherited from peers applet, see
dbd026792 ("BUG/MEDIUM: peers: set NOLINGER on the outgoing stream
interface") is not desirable under sink context:

The reason why we have the NOLINGER flag set is to ensure the connection
is closed right away and avoid 60s TIME_WAIT delay on closed sockets.
The downside is that messages sent right before closing the socket are
not guaranteed to make it to the server because closing with NOLINGER
flag set will result in RST packet being emitted right away, which could
prevent in-flight messages from being properly delivered.

Unlike peers applets, the only cases were sink applets are expected to
close the connection are upon unexpected error or upon stopping, which are
relatively rare events. Thanks to previous commit, ERROR flag is already
set in case of error, so the use of NOLINGER is not mandatory for the
RST to be sent. Now for the stopping case, it only happens once in the
process lifetime so it's acceptable to close the socket using EOS+EOI
flags without the NOLINGER option set.

So in our case, it is preferable to ensure messages get properly delivered
knowning that closed sockets should be piling up in TIME_WAIT, this means
removing the NOLINGER flag on the outgoing stream interface for sink
applets. It is a prerequisite for upcoming patches in order to cleanly
shut the applet during runtime without risking to send the RST packet
before all pending messages were sent to the endpoint.

MINOR: sink: distinguish between hard and soft close in _sink_forward_io_handler()

Aborting the socket on soft-stop is not the same as aborting it due to
unexpected error. As such, let's leverage the granularity offered by
sedesc flags to better reflect the situation: abort during soft-stop is
handled as a soft close thanks to EOI+EOS flags, while abort due to
unexpected error is handled as hard error thanks to ERROR+EOS flags.

Thanks to this change, hard error will always emit RST packet even if
the NOLINGER option wasn't set on the socket.

MINOR: sink: add some comments about sft->appctx usage in applet handlers

There seem to be an ambiguity in the code where sft->appctx would differ
from the appctx that was assigned to it upon appctx creation.

In practise, it doesn't seem this could be happening. Adding a few notes
to come back to this later and try to see if we can remove this
ambiguity.