This patch extends the documentation for "limited-quic" global keyword.
It mentions first that it relies on USE_QUIC_OPENSSL_COMPAT=1 build
option.
Compatibility with TLS libraries is now clearly exposed. In particular,
it highlights the fact that it is mostly targetted at OpenSSL version
prior to 3.5.2, and that it should be disabled if a recent OpenSSL
release is available. It also states that limited-quic does nothing if
USE_QUIC_OPENSSL_COMPAT is not set during compilation.
MINOR: quic: display build warning for compat layer on recent OpenSSL
Build option USE_QUIC_OPENSSL_COMPAT=1 must be set to activate QUIC
support for OpenSSL prior to version 3.5.2. This compiles an internal
compatibility layer, which must be then activated at runtime with global
option limited-quic.
Starting from OpenSSL version 3.5.2, a proper QUIC TLS API is now
exposed. Thus, the compatibility layer is unneeded. However it can still
be compiled against newer OpenSSL releases and activated at runtime,
mostly for test purpose.
As this compatibility layer has some limitations, (no support for QUIC
0-RTT), it's important that users notice this situation and disable it
if possible. Thus, this patch adds a notice warning when
USE_QUIC_OPENSSL_COMPAT=1 is set when building against OpenSSL 3.5.2 and
above. This should be sufficient for users and packagers to understand
that this option is not necessary anymore.
Note that USE_QUIC_OPENSSL_COMPAT=1 is incompatible with others TLS
library which exposed a QUIC API based on original BoringSSL patches
set. A build error will prevent the compatibility layer to be built.
limited-quic option is thus silently ignored.
MINOR: quic-be: make SSL/QUIC objects use their own indexes (ssl_qc_app_data_index)
This index is used to retrieve the quic_conn object from its SSL object, the same
way the connection is retrieved from its SSL object for SSL/TCP connections.
This patch implements two helper functions to avoid the ugly code with such blocks:
#ifdef USE_QUIC
else if (qc) { .. }
#endif
Implement ssl_sock_get_listener() to return the listener from an SSL object.
Implement ssl_sock_get_conn() to return the connection from an SSL object
and optionally a pointer to the ssl_sock_ctx struct attached to the connections
or the quic_conns.
Use this functions where applicable:
- ssl_tlsext_ticket_key_cb() calls ssl_sock_get_listener()
- ssl_sock_infocbk() calls ssl_sock_get_conn()
- ssl_sock_msgcbk() calls ssl_sock_get_ssl_conn()
- ssl_sess_new_srv_cb() calls ssl_sock_get_conn()
- ssl_sock_srv_verifycbk() calls ssl_sock_get_conn()
Also modify qc_ssl_sess_init() to initialize the ssl_qc_app_data_index index for
the QUIC backends.
MINOR: quic: get rid of ->target quic_conn struct member
The ->li (struct listener *) member of quic_conn struct was replaced by a
->target (struct obj_type *) member by this commit:
MINOR: quic-be: get rid of ->li quic_conn member
to abstract the connection type (front or back) when implementing QUIC for the
backends. In these cases, ->target was a pointer to the ojb_type of a server
struct. This could not work with the dynamic servers contrary to the listeners
which are not dynamic.
This patch almost reverts the one mentioned above. ->target pointer to obj_type member
is replaced by ->li pointer to listener struct member. As the listener are not
dynamic, this is easy to do this. All one has to do is to replace the
objt_listener(qc->target) statement by qc->li where applicable.
For the backend connection, when needed, this is always qc->conn->target which is
used only when qc->conn is initialized. The only "problematic" case is for
quic_dgram_parse() which takes a pointer to an obj_type as third argument.
But this obj_type is only used to call quic_rx_pkt_parse(). Inside this function
it is used to access the proxy counters of the connection thanks to qc_counters().
So, this obj_type argument may be null for now on with this patch. This is the
reason why qc_counters() is modified to take this into consideration.
BUG/MAJOR: stream: Force channel analysis on successful synchronous send
This patchs reverts commit a498e527b ("BUG/MAJOR: stream: Remove READ/WRITE
events on channels after analysers eval") because of a regression. It was an
attempt to properly detect synchronous sends, even when the stream was woken
up on a write event. However, the fix was wrong because it could mask
shutdowns performed during process_stream() and block the stream.
Indeed, when a shutdown is performed, because an error occurred for
instance, a write event is reported. The commit above could mask this event
while the shutdown prevent any synchronous sends. In such case, the stream
could remain blocked infinitly because an I/O event was missed.
So to properly fix the original issue (#3070), the write event must not be
masked before a synchronous send. Instead, we now force the channel analysis
by setting explicitly CF_WAKE_ONCE flags on the corresponding channel if a
write event is reported after the synchronous send. CF_WRITE_EVENT flag is
remove explicitly just before, so it is quite easy to detect.
This patch must be backport to all stable version in same time of the commit
above.
MEDIUM: peers: move process_peer_sync() to a single thread
The remaining half of the task_queue() and task_wakeup() contention
is caused by this function when peers are in use, because just like
process_table_expire(), it's created using task_new_anywhere() and
is woken up for local updates. Let's turn it to single thread by
rotating the assigned threads during initialization so that a table
only runs on one thread at a time.
Here we go backwards to assign the threads, so that on small setups
they don't end up on the same CPUs as the ones used by the stick-tables.
This way this will make an even better use of large machines. The
performance remains the same as with previous patch, even slightly
better (1-3% on avg).
At this point there's almost no multi-threaded task activity anymore
(only srv_cleanup_idle_server once in a while). This should improve
the situation described by Felipe in issues #3084 and #3101.
This should be backported to 3.2 after some extended checks.
MEDIUM: stick-table: move process_table_expire() to a single thread
A big deal of the task_queue() contention is caused by this function
because it's created using task_new_anywhere() and is subject to
heavy updates. Let's turn it to single thread by rotating the assigned
threads during initialization so that a table only runs on one thread
at a time.
However there's a trick: the function used to call task_queue() to
requeue the task if it had advanced its timer (may only happen when
learning an entry from a peer). We can't do that anymore since we can't
queue another thread's task. Thus instead of the task needs to be
scheduled earlier than previously planned, we simply perform a wakeup.
It will likely do nothing and will self-adjust its next wakeup timer.
Doing so halves the number of multi-thread task wakeups. In addition
the request rate at saturation increased by 12% with 16 peers and 40
tables on a 16 8-thread processes. This should improve the situation
described by Felipe in issues #3084 and #3101.
This should be backported to 3.2 after some extended checks.
BUG/MINOR: stick-table: make sure never to miss a process_table_expire update
In stktable_requeue_exp(), there's a tiny race at the beginning during
which we check the task's expiration date to decide whether or not to
wake process_table_expire() up. During this race, the task might just
have finished running on its owner thread and we can miss a task_queue()
opportunity, which probably explains why during testing it seldom happens
that a few entries are left at the end.
Let's perform a CAS to confirm the value is still the same before
leaving. This way we're certain that our value has been seen at least
once.
MEDIUM: resolvers: make the process_resolvers() task single-threaded
This task is sometimes caught triggering the watchdog while waiting for
the infamous resolvers lock, or the scheduler's wait queue lock in
task_queue(). Both are caused by its multi-threaded capability. The
task may indeed start on a thread that's different from the one that
is currently receiving a response and that holds the resolvers lock,
and when being queued back, it requires to lock the wait queue. Both
problems disappear when sticking it to a single thread. But for configs
running multiple resolvers sections, it would be suboptimal to run them
all on the same thread. In order to avoid this, we implement a counter
in the resolvers_finalize_config() section that rotates the thread for
each resolvers section.
This was sufficient to further improve the performance here, making the
CPU usage drop to about 7% (from 11 previously or 38 initially) and not
showing any resolvers lock contention anymore in perf top output.
The change was kept fairly minimal to permit a backport once enough
testing is conducted on it. It could address a significant part of
the trouble reported by Felipe in GH issue #3101.
MEDIUM: dns: bind the nameserver sockets to the initiating thread
There's still a big architectural limitation in the dns/resolvers code
regarding threads: resolvers run as a task that is scheduled to run
anywhere, and each NS dgram socket is bound to any thread of the same
thread group as the initiating thread. This becomes a big problem when
dealing with multiple nameservers because responses arrive on any thread,
start by locking the resolvers section, and other threads dealing with
responses are just stuck waiting for the lock to disappear. This means
that most of the time is exclusively spent causing contention. The
process_resolvers() function also also suffers from this contention
but apparently less often.
It turns out that the nameserver sockets are created during emission
of the first packet, triggered from the resolvers task. The present
patch exploits this to stick all sockets to the calling thread instead
of any thread. This way there is no longer any contention between
multiple nameservers of a same resolvers section. Tests with a section
having 10 name servers showed that the CPU usage dropped from 38 to
about 10%, or almost by a factor of 4.
Note that TCP resolvers do not offer this possibility because the
tasks that manage the applets are created earlier to run anywhere
during config parsing. This might possibly be refined later, e.g.
by changing the task's affinity when it first runs.
The change was kept fairly minimal to permit a backport once enough
testing is conducted on it. It could address a significant part of
the trouble reported by Felipe in GH issue #3101.
BUG/MEDIUM: ssl: Fix a crash if we failed to create the mux
In ssl_sock_io_cb(), if we failed to create the mux, we may have
destroyed the connection, so only attempt to access it to get the ALPN
if conn_create_mux() was successful.
This fixes crashes that may happen when using ssl.
Commit 5ab9954faa9c815425fa39171ad33e75f4f7d56f introduced a new flag in
ssl_sock_ctx, to know that an ALPN was negociated, however, the way to
get the ssl_sock_ctx was wrong for QUIC. If we're using QUIC, get it
from the quic_conn.
This should fix crashes when attempting to use QUIC.
DEBUG: stick-tables: export stktable_add_pend_updates() for better reporting
This function is a tasklet handler used to send peers updates, and it can
happen quite a bit in "show tasks" and "show profiling tasks", so let's
export it so that we don't face a cryptic symbol name:
BUG/MEDIUM: stick-tables: don't loop on non-expirable entries
The stick-table expiration of ref-counted entries was insufficiently
addresse by commit 324f0a60ab ("BUG/MINOR: stick-tables: never leave
used entries without expiration"), because now entries are just requeued
where they were, so they're visited over and over for long sessions,
causing process_table_expire() to loop, eating CPU and causing lock
contention.
Here we take care of refreshing their timeer when they are met, so
that we don't meet them more than once per stick-table lifetime. It
should address at least a part of the recent degradation that Felipe
noticed in GH #3084.
Since the fix above was marked for backporting to 3.2, this one should
be backported there as well.
MINOR: tools: don't emit "+0" for symbol names which exactly match known ones
resolve_sym_name() knows a number of symbols, but when one exactly matches
(e.g. a task's handler), it systematically displays the offset behind it
("+0"). Let's only show the offset when non-zero. This can be backported
as this is helpful for debugging.
MINOR: activity: indicate the number of calls on "show tasks"
The "show tasks" command can be useful to inspect run queues for active
tasks, but currently it's difficult to distinguish an occasional running
task from a heavily active one. Let's collect the number of calls for
each of them, report them average on the number of instances of each task
as well as a percentage of the total used. This way it even becomes
possible to get a hint about how CPU usage is distributed.
BUG/MINOR: activity: fix reporting of task latency
In 2.4, "show tasks" was introduced by commit 7eff06e162 ("MINOR:
activity: add a new "show tasks" command to list currently active tasks")
to expose some info about running tasks. The latency is not correct
because it's a u32 subtracted from a u64. It ought to have been casted
to u32 for the operation, which is what this patch does.
BUILD: ssl: address a recent build warning when QUIC is enabled
Since commit 5ab9954faa ("MINOR: ssl: Add a flag to let it known we have
an ALPN negociated"), when building with QUIC we get this warning:
src/ssl_sock.c: In function 'ssl_sock_advertise_alpn_protos':
src/ssl_sock.c:2189:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
Let's just move the instructions after the optional declaration. No
backport is needed.
MEDIUM: server: Make use of the stored ALPN stored in the server
Now that which ALPN gets negociated for a given server, use that to
decide if we can create the mux right away in connect_server(), and use
it in conn_install_mux_be().
That way, we may create the mux soon enough for early data to be sent,
before the handshake has been completed.
This commit depends on several previous commits, and it has not been
deemed important enough to backport.
Willy Tarreau [Thu, 7 Aug 2025 15:32:24 +0000 (17:32 +0200)]
CLEANUP: backend: clarify the cases where we want to use early data
The conditions to use early data on output are super tricky and
detected later, so that it's difficult to figure how this works. This
patch splits the condition in two parts, the one that can be performed
early that is based on config/client/etc. It is used to clear a variable
that allows early data to be used in case any condition is not satisfied.
It was purposely split into multiple independent and reviewable tests.
The second part remains where it was at the end, and is used to temporarily
clear the handshake flags to let the data layer use early data. This one
being tricky, a large comment explaining the principle was added.
The logic was not changed at all, only the code was made more readable.
Willy Tarreau [Thu, 7 Aug 2025 15:06:45 +0000 (17:06 +0200)]
CLEANUP: backend: simplify the complex ifdef related to 0RTT in connect_server()
Since 3.0 we have HAVE_SSL_0RTT precisely to avoid checking horribly
complicated and unmaintainable conditions to detect support for 0RTT.
Let's just drop the complex condition and use the macro instead.
Willy Tarreau [Thu, 7 Aug 2025 14:30:38 +0000 (16:30 +0200)]
CLEANUP: backend: invert the condition to start the mux in connect_server()
Instead of trying to switch from delayed start to instant start based
on a single condition, let's do the opposite and preset the condition
to instant start and detect what could cause it to be delayed, thus
falling back to the slow mode. The condition remains exactly the
inverted one and better matches the comment about ALPN being the only
cause of such a delay.
Willy Tarreau [Thu, 7 Aug 2025 14:07:37 +0000 (16:07 +0200)]
CLEANUP: backend: clarify the role of the init_mux variable in connect_server()
The init_mux variable is currently used in a way that's not super easy
to grasp. It's set a bit too late and requires to know a lot of info at
once. Let's first rename it to "may_start_mux_now" to clarify its role,
as the purpose is not to *force* the mux to be initialized now but to
permit it to do it.
MEDIUM: server: Introduce the concept of path parameters
Add a new field in struct server, path parameters. It will contain
connection informations for the server that are not expected to change.
For now, just store the ALPN negociated with the server. Each time an
handhskae is done, we'll update it, even though it is not supposed to
change. This will be useful when trying to send early data, that way
we'll know which mux to use.
Each time the server goes down or is disabled, those informations are
erased, as we can't be sure those parameters will be the same once the
server will be back up.
MINOR: ssl: Use the new flag to know when the ALPN has been set.
How that we have a flag to let us know the ALPN has been set, we no
longer have to call ssl_sock_get_alpn() to know if the alpn has been
negociated already.
Remove the call to conn_create_mux() from ssl_sock_handshake(), and just
reuse the one already present in ssl_sock_io_cb() if we have received
early data, and if the flag is set.
MINOR: ssl: Add a flag to let it known we have an ALPN negociated
Add a new flag to the ssl_sock_ctx, to be set as soon as the ALPN has
been negociated.
This happens before the handshake has been completed, and that
information will let us know that, when we receive early data, if the
ALPN has been negociated, then we can immediately create a mux, as the
ALPN will tell us which mux to use.
BUG/MEDIUM: ssl: create the mux immediately on early data
If we received early data, and an ALPN has been negociated, then
immediately try to create a mux if we did not have one already.
Generally, at this point we would not have one, as the mux is decided by
the ALPN, however at this point, even if the handshake is not done yet,
we have enough to determine the ALPN, so we can immediately create the
mux.
Doing so makes up able to treat the request immediately, without waiting
for the handshake to be done.
BUG/MEDIUM: h1: Allow reception if we have early data
In h1_recv_allowed(), do not forbid the reception if we are yet to
complete the connection, if we have received early data on it. That way,
we can deal with them right away, instead of waiting for the handshake
to be done.
MEDIUM: peers: don't even try to process updates under contention
Recent fix 2421c3769a ("BUG/MEDIUM: peers: don't fail twice to grab the
update lock") improved the situation a lot for peers under locking
contention but still not enough for situations with many peers and
many entries to expire fast. It's indeed still possible to trigger
warnings at end of injection sessions for 16 peers at 100k req/s each
doing 10 random track-sc when process_table_expire() runs and holds the
update lock if compiled with a high value of STKTABLE_MAX_UPDATES_AT_ONCE
(1000). Better just not insist in this case and postpone the update.
At this point, under load only ebmb_lookup() consumes CPU, other functions
are in the few percent, indicating reasonable contention, and peers remain
updated.
This should be backported to 3.2 after a bit of testing.
MEDIUM: stick-tables: don't wait indefinitely in stktable_add_pend_updates()
This one doesn't need to wait forever, if it cannot work it can postpone
it. When building with a high value of STKTABLE_MAX_UPDATES_AT_ONCE (1000),
it's still possible to trigger warnings in this function on the write lock
that is contended by peers and expiration. Changing it for a trylock resolves
the issue.
This should be backported to 3.2 after a bit of testing.
MEDIUM: stick-tables: give up on lock contention in process_table_expire()
process_table_expire() can take quite a lot of time running over all
shards. During this time it will hinder track-sc rules and peers, which
will experience an increased latency to do their work, especially peers
where each message will cause a lock, whose cumulated time can exceed
the watchdog's patience.
Here, we proceed just like in stktable_trash_oldest(), which is that
we're using a trylock to detect contention. The first time it happens,
if we hadn't purged anything, we switch to a regular lock to perform
the operation, and next time it happens we abort. This guarantees that
some entries will be expired and that contention will be reduced with
when detected.
With this change, various tests didn't manage to produce any warning,
including at the end of the load generation session.
This should be backported to 3.2 after a bit more testing.
MEDIUM: stick-tables: relax stktable_trash_oldest() to only purge what is needed
stktable_trash_oldest() does insist a lot on purging what was requested,
only limited by STKTABLE_MAX_UPDATES_AT_ONCE. This is called in two
conditions, one to allocate a new stksess, and the other one to purge
entries of a stopping process. The cost of iterating over all shards
is huge, and a shard lock is taken each time before looking up entries.
Moreover, multiple threads can end up doing the same and looking hard for
many entries to purge when only one is needed. Furthermore, all threads
start from the same shard, hence synchronize their locks. All of this
costs a lot to other operations such as access from peers.
This commit simplifies the approach by ignoring the budget, starting
from a random shard number, and using a trylock so as to be able to
give up early in case of contention. The approach chosen here consists
in trying hard to flush at least one entry, but once at least one is
evicted or at least one trylock failed, then a failure on the trylock
will result in finishing.
The function now returns a success as long as one entry was freed.
With this, tests no longer show watchdog warnings during tests, though
a few still remain when stopping the tests (which are not related to
this function but to the contention from process_table_expire()).
With this change, under high contention some entries' purge might be
postponed and the table may occasionally contain slightly more entries
than their size (though this already happens since stksess_new() first
increments ->current before decrementing it).
Measures were made on a 64-core system with 8 peers
of 16 threads each, at CPU saturation (350k req/s each doing 10
track-sc) for 10M req, with 3 different approaches:
- this one resulted in 1500 failures to find an entry (0.015%
size overhead), with the lowest contention and the fairest
peers distibution.
- leaving only after a success resulted in 229 failures (0.0029%
size overhead) but doubled the time spent in the function (on
the write lock precisely).
- leaving only when both a success and a failed lock were met
resulted in 31 failures (0.00031% overhead) but the contention
was high enough again so that peers were not all up to date.
Considering that a saturated machine might exceed its entries by
0.015% is pretty minimal, the mechanism is kept.
This should be backported to 3.2 after a bit more testing as it
resolves some watchdog warnings and panics. It requires precedent
commit "MINOR: stick-table: permit stksess_new() to temporarily
allocate more entries" to over-allocate instead of failing in case
of contention.
MINOR: stick-table: permit stksess_new() to temporarily allocate more entries
stksess_new() calls stktable_trash_oldest() to release some entries.
If it fails however, it will fail to allocate an entry. This is a problem
because it doesn't permit stktable_trash_oldest() to be used in best effort
mode, which forces it to impose high contention. There's no problem with
allocating slightly more in practice. In the worst case if all entries are
in use, it's not shocking to temporarily exceed the number of entries by a
few units.
Let's relax this problematic rule. This patch might need to be backported
to 3.2 after a bit more testing in order to support locking relaxation.
The following functions take locks and are often involved in warnings
but are currently not resolved, so let's export them so that they are
properly decoded:
MINOR: debug: report the time since last wakeup and call
When task profiling is enabled, the current thread knows when the
currently running task was woken up and called, so we can calculate
how long ago it was woken up and called. This is convenient to figure
whether or not a warning or panic is caused by this task or by a
previous one, so let's report this info in thread outputs when known.
MINOR: debug: report the number of loops and ctxsw for each thread
When multiple similar warnings are emitted, it can be difficult to know
whether only one task is looping slowly or if many are sharing the CPU.
Let's report the number of context switches and polling loop turns in
thread dumps so that warnings are easier to understand.
DEBUG: stream: count the number of passes in the connect loop
Normally the connect loop cannot loop, but some recent traces can easily
convince one of the opposite. Let's add a counter, including in panic
dumps, in order to avoid the repeated long head scratching sessions
starting with "and what if...". In addition, if it's found to loop, this
time it will be certain and will indicate what to zoom in. This should
be backported to 3.2.
MINOR: debug: report the process id in warnings and panics
Warning and panic messages currently do not report the PID. This is
annoying when trying to reproduce problems because warnings do not
allow know which process to attach to in order to debug, and panics
do not permit to know which core dump corresponds to which dump.
Let's add them in both messages. This should probably be backported
at least to 3.2.
MINOR: check: reject invalid check config on a QUIC server
QUIC is now supported on the backend side. The previous commit ensures
that simple checks can be activated on QUIC servers without any issue.
The current patch ensures that check server settings remain compatible
with a QUIC server. Thus, configuration is now invalid if check
specifies an explicit MUX proto other than QUIC, disables SSL or try to
use PROXY protocol.
BUG/MINOR: check: ensure checks are compatible with QUIC servers
Previously, checks were only performed on TCP. However, QUIC is now
supported on backend. Prior to this patch, check activation for QUIC
servers would result in a crash.
To ensure compatibility between QUIC servers and checks, adjust
protocol_lookup() performed during check connect step. Instead of using
a hardcoded PROTO_TYPE_STREAM, the value is now derived from server
settings.
BUG/MEDIUM: checks: fix ALPN inheritance from server
If no specific check settings are defined on a server line, it is
expected that these checks will be performed with the same parameters as
normal connections on the same server.
ALPN must be carefully taken into account for checks. Most notably, MUX
initialization is delayed so that it is performed only after SSL
handshake.
Prior to this patch, MUX init delay was only performed if ALPN was
defined via check settings. Thus, with the following settings, checks
would be performed on HTTP/1.1 without consulting ALPN negotiation
result from the server :
server s1 127.0.0.1:443 ssl crt <...> alpn h2 check
This bug may result in checks reporting failure, for example in case of
a server answering HTTP/2 to ALPN negotiation to the configuration
above. Besides, there is incoherency between normal and check
connections, which is not what the documentation specifies.
This patch fixes this code. Now server parameters are also taken into
account. This ensures that checks and normal connections by default
use the same connection method.
OPTIM: check: do not delay MUX for ALPN if SSL not active
To ensure ALPN is properly applied on checks, MUX initialization is
delayed so that it is created on SSL handshake completion. However, this
does not check if SSL is really active for the connection.
This patch adjusts the condition so that MUX init is not delayed if SSL
is not active for the check connection. A similar process is already
conducted for normal connections via connect_server().
This must be backported up to 2.4. Despite not being a bug, it must be
backported for the following patch which fixes check ALPN inheritance
from server settings.
BUG/MINOR: hq-interop: adjust parsing/encoding on backend side
HTTP/0.9 is available on top of QUIC. This protocol is reserved for
internal use, mostly interop purpose.
This patch adjusts HTTP/0.9 layer with the following changes :
* version is not emitted anymore on the status line. This is performed
as some servers does not parse it correctly.
* status line is set explicitely on HTX status-line. This ensures the
correct HTTP status code is reported to the upper stream layer.
BUG/MEDIUM: mux-h2: Reinforce conditions to report an error to app-layer stream
This patch relies on the previous one ("BUG/MEDIUM: mux-h2: Report RST/error to
app-layer stream during 0-copy fwding").
When the end of the connection is detected, so when the H2_CF_END_REACHED
flag is set after the shutdown was received and all incoming data were
processed, if a stream is blocked by the flow control (the stream one or the
connection one), an error must be reported to the app-layer stream.
Otherwise, outgoing data won't be sent and the opposite side will handle
this as a lack of room. So the stream will be blocked until the write
timeout is triggerd. By reporting the error early, the stream can be
immediately closed.
This patch should be backported to 3.2. For older versions, it is probably a
good idea to wait for bug report.
BUG/MEDIUM: mux-h2: Report RST/error to app-layer stream during 0-copy fwding
In h2_nego_ff(), it is important to report reset and error to app-layer
stream and to send the RST-STREAM frame accordingly. It is not clear if it
is an issue or not. But it is clearly a difference with the classical
forwarding via h2_snd_buf. And it is mandatory for the next fix.
This patch should be backported to 3.2. But is is probably a good idea to
not backport it on older versions, except if a bug is reported in this area.
BUG/MINOR: mux-h2: Remove H2_CF_DEM_DFULL flags when the demux buffer is reset
This only happens when a connection error is detected or when the H2
connection is in ERR/ERR2 state. The demux buffer is explicitly reset. In
that case, it is important to remove the flag reporting this buffer as full.
It is probably worth to backport this patch to 3.2. But it is not mandatory
on older versions because it does not fix any known issue.
BUG/MEDIUM: mux-h2: Restart reading when mbuf ring is no longer full
When the mbuf ring buffer is full, the flag H2_CF_DEM_MROOM is set on the H2
connection to block any demux. It is important to properly handle ACK
frames. However, we must take care to restart reading when some data were
removed from the mbuf. Otherwise, we may block the demux for no reason. It
is especially an issue if the demux buffer is full. In that case, the H2
connection is blocked, waiting for the timeout.
This patch should be backported to 3.2. But is is probably a good idea to
not backport it on older versions, except if a bug is reported in this area.
BUG/MEDIUM: mux-h2; Don't block reveives in H2_CS_ERROR and H2_CS_ERROR2 states
The H2 connection is switched to ERR when a GOAWAY must be sent and in ERR2
when it is sent. In these states, no more data can be emitted by the
mux. But there is no reason to not try to process incoming data or to not
try to receive data. It is espcially important to be able to get the
shutdown from the TCP connection when a SSL connection was previously
detected. Otherwise, it is possible to block a H2 connection until its
timeout expiration to be able to close it.
This patch should be backported to 3.2. But is is probably a good idea to
not backport it on older versions, except if a bug is reported in this
area.
BUG/MEDIUM: mux-h2: Reset MUX blocking flags when a send error is caught
When an send error is detected on the underlying connection, a pending error
is reported to the H2 connection by setting H2_CF_ERR_PENDING flag. When
this happen the tail of the mux ring buffer is reset. However some blocking
flags remain set and have no chance to be removed later because of the
pending error. Especially the flag H2_CF_DEM_MROOM which block data
demultiplexing. Thus, it is possible to block a H2 connection with unparsed
incoming data.
Worse, if a read event is received, it could lead to a wakeup loop between
the H2 connection and the underlying SSL connection. The H2 connection is
unable to convert the pending error to a fatal error because the
demultiplexing is blocked. In the mean time, it tries to receive more data
because of the not-consumed read event. On the underlying connection side,
the error detected earlier blocks the read, but the H2 connection is woken
up to handle the error.
To fix the issue, blocking flags must be removed when a send error is caught,
H2_CF_MUX_MFULL and H2_CF_DEM_MROOM flags. But, it is not necessary to only
release the tail of the mbuf ring. When a send error is detected, all outgoing
data can be flushed. So, now, in h2_send(), h2_release_mbuf() function is called
on pending error. The mbuf ring is fully released and H2_CF_MUX_MFULL and
H2_CF_DEM_MROOM flags are removed.
Many thanks to Krzysztof Kozłowski for its help to spot this issue.
This patch could be backported at least as far as 2.8. But it is a bit
sensitive. So, it is probably a good idea to backport it to 3.2 for now and
wait for bug report on older versions.
BUG/MINOR: quic: properly support GSO on backend side
Previously, GSO emission was explicitely disabled on backend side. This
is not true since the following patch, thus GSO can be used, for example
when transfering large POST requests to a HTTP/3 backend.
However, GSO on the backend side may cause crash when handling EIO. In
this case, GSO must be completely disabled. Previously, this was
performed by flagging listener instance. In backend side, this would
cause a crash as listener is NULL.
This patch fixes it by supporting GSO disable flag for servers. Thus, in
qc_send_ppkts(), EIO can be converted either to a listener or server
flag depending on the quic_conn proxy side. On backend side, server
instance is retrieved via <qc.conn.target>. This is enough to guarantee
that server is not deleted.
MINOR: pools: Don't dump anymore info about pools when purge is forced
Historically, when the purge of pools was forced by sending a SIGQUIT to
haproxy, information about the pools were first dumped. It is now totally
pointless because these info can be retrieved via the CLI. It is even less
relevant now because the purge is forced typically when there are memroy
issues and to dump pools information, data must be allocated.
dump_pools_info() function was simplified because it is now called only from
an applet. No reason to still try to dump info on stderr.
BUG/MINOR: pools: Fix the dump of pools info to deal with buffers limitations
The "show pools" CLI command was not designed to dump information exceeding
the size of a buffer. But there is now much more pools than few years ago
and when detailed information are dumped, we exceeds the buffer limit and
the output is truncated.
To fix the issue, the command must be refactored to be able to stream the
result. To do so, the array containing pools info is now part of the command
context and it is dynamically allocated. A dedicated function was created to
fill all info. In addition, the index of the next pool to dump is saved in
the command context too to properly handle resumption cases. Finally global
information about pools are also stored in the command context for
convenience.
This patch should fix the issue #3067. It must be backported to 3.2. On
older release, the buffer limit is never reached.
REGTESTS: ssl: Fix the script about automatic SNI selection
First, the barrier to delay the client execution was moved before the client
definition. Otherwise, the connection is established too early and with
short timeouts it could be closed before the requests are sent.
The main purpose of the barrier was to workaround slow health-checks. This
is also the reason why the script was flagged as slow. But it can be
significantly speed-up by setting a slow "inter" value. It is now set to
100ms and the script is no longer slow.
The below patch fixes padding emission for small packets, which is
required to ensure that header protection removal can be performed by
the recipient.
In addition to the proper fix, constant QUIC_HP_SAMPLE_LEN was removed
and replaced by QUIC_TLS_TAG_LEN. However, it still makes sense to have
a dedicated constant which represent the size of the sample used for
header protection. Thus, this patch restores it.
Special instructions for backport : above patch mentions that no
backport is needed. However, this is incorrect, as bug is introduced by
another patch scheduled for backport up to 2.6. Thus, it is first
mandatory to schedule d7dea408c64c327cab6aebf4ccad93405b675565 after it.
Then, this patch can also be used for the sake of code clarity.
Define a new "quic_tx" unit-test which is used to test QUIC TX module.
For the moment, a single test is performed on qc_do_build_pkt(). It
checks that PADDING is correctly added for HP sampling in case of a
small packet.
MINOR: stats-file: use explicit unsigned integer bitshift for user slots
As reported in GH #3104, there remained a place where (1 << shift was
used to set or remove bits from uint64_t users bitfield. It is incorrect
and could lead to bugs for values > 32 bits.
Instead, let's use 1ULL to ensure the operation remains 64bits consistent.
BUG/MEDIUM: proxy: fix crash with stop_proxy() called during init
Willy reported that the following config would segfault right after the
"removing incomplete section 'peer' is emitted:
peers peers
bind :2300
server n10 127.0.0.1:2310
listen dummy
bind localhost:9999
This is caused by the fact that stop_proxy(), which tries to read shared
counters, is called during early init while shared counters are not yet
initialized. To fix the crash, let's check if we're still during starting
phase, in which case we assume the counters are not initialized and we
assume 0 value instead.
No backport needed unless 16eb0fab31 ("MAJOR: counters: dispatch counters
over thread groups") is.
Mimic the same behavior as the one for SSL/TCP connetion to implement the
SSL session reuse.
Extract the code which try to reuse the SSL session for SSL/TCP connections
to implement ssl_sock_srv_try_reuse_sess().
Call this function from QUIC ->init() xprt callback (qc_conn_init()) as this
done for SSL/TCP connections.
When kTLS is compiled in, make sure msg_controllen is initialized to 0.
If we're not actually kTLS, then it won't be set, but we'll check that
it is non-zero later to check if we ancillary data.
This does not need to be backported.
This should fix CID 1620865, as reported in github issue #3106.
BUG/MINOR: cpu_topo: work around a small bug in musl's CPU_ISSET()
As found in GH issue #3103, CPU_ISSET() on musl 1.25 doesn't match the man
page which says it's returning an int. The reason is pretty simple, it's
a macro that operates on the bits directly and returns the result of the
bit field applied to the mask as an unsigned long. Bits above 31 will
simply be dropped if returned as an int, which causes CPUs 32..63 to
appear as absent from cpu_sets.
The fix is trivial, it consists in just comparing the result against zero
(i.e. turning it to a boolean), but before it's merged and deployed we'll
have to face such deployments, so better implement the same workaround
in the code here since we have access to the raw long value.
BUG/MINOR: quic: too short PADDING frame for too short packets
This bug arrvived with this commit:
MINOR: quic: centralize padding for HP sampling on packet building
What was missed is the fact that at the centralization point for the
PADDING frame to add for too short packet, <len> payload length already includes
<*pn_len> the packet number field length value.
So when computing the length of the PADDING frame, the packet field length must
not be considered and added to the payload length (<len>).
This bug leaded too short PADDING frame to too short packets. This was the case,
most of times with Application level packets with a 1-byte packet number field
followed by a 1-byte PING frame. A 1-byte PADDING frame was added in this case
in place of a correct 2-bytes PADDINF frame. The header packet protection of
such packet could not be removed by the clients as for instance for ngtcp2 with
such traces:
I00001828 0x5a135c81e803f092c74bac64a85513b657 pkt could not decrypt packet number
As the header protection could no be removed, the header keyupdate bit could also
not be read by packet analyzers such as pyshark used during the keyupdate tests.
REGTESTS: ssl: Add a script to test the automatic SNI selection
The script reg-tests/ssl/ssl_sni_auto.vtc tests the automatic SNI selection
for regular server connections and for health-check ones. It rely on a
3.3-dev8 feature (in fact, it was pushed just after the dev8).
MEDIUM: httpcheck/ssl: Base the SNI value on the HTTP host header by default
Similarly to the automic SNI selection for regulat SSL traffic, the SNI of
health-checks HTTPS connection is now automatically set by default by using
the host header value. "check-sni-auto" and "no-check-sni-auto" server
settings were added to change this behavior.
Only implicit HTTPS health-checks can take advantage of this feature. In
this case, the host header value from the "option httpchk" directive is used
to extract the SNI. It is disabled if http-check rules are used. So, the SNI
must still be explicitly specified via a "http-check connect" rule.
This patch with should paritally fix the issue #3081.
MEDIUM: server/ssl: Base the SNI value to the HTTP host header by default
For HTTPS outgoing connections, the SNI is now automatically set using the
Host header value if no other value is already set (via the "sni" server
keyword). It is now the default behavior. It could be disabled with the
"no-sni-auto" server keyword. And eventually "sni-auto" server keyword may
be used to reset any previous "no-sni-auto" setting. This option can be
inherited from "default-server" settings. Finally, if no connection name is
set via "pool-conn-name" setting, the selected value is used.
The automatic selection of the SNI is enabled by default for all outgoing
connections. But it is concretely used for HTTPS connections only. The
expression used is "req.hdr(host),host_only".
This patch should paritally fix the issue #3081. It only covers the server
part. Another patch will add the feature for HTTP health-checks.
BUG/MINOR: tcpcheck: Don't use sni as pool-conn-name for non-SSL connections
When we try to ruse connection to perform an healtcheck, the SNI, from the
tcpcheck connection or the healthcheck itself, must not be used as
connection name for non-SSL connections.
OPTIM: tcpcheck: Don't set SNI and ALPN for non-ssl connections
There is no reason to set the SNI and ALPN for non-ssl connections. It is
not really an issue because ssl_sock_set_servername() and
ssl_sock_set_alpn() functions will do nothing. But it is cleaner this way
and this could avoid bugs in future.
OPTIM: proto_rhttp: Don't set SNI for non-ssl connections
There is no reason to set the SNI for non-ssl connections. It is not really
an issue because ssl_sock_set_servername() function will do nothing. But
there is no reason to uselessly evaluate an expression.
OPTIM: backend: Don't set SNI for non-ssl connections
There is no reason to set the SNI for non-ssl connections. It is not really
an issue because ssl_sock_set_servername() function will do nothing. But
there is no reason to uselessly evaluate an expression.
BUG/MINOR: server: Update healthcheck when server settings are changed via CLI
not all changes are concerned. But when the SSL is enabled or disabled for a
server, the healthcheck xprt must be eventually be updated too. This happens
when the healthcheck relies on the server settings.
In the same spirit, when the healthcheck address and port are updated, we
must fallback on the raw xprt if the SSL is not explicitly enabled for the
healthcheck with a "check-ssl" parameter.
This patch should be backported to all stable versions.
BUG/MEDIUM: server: Use sni as pool connection name for SSL server only
By default, for a given server, when no pool-conn-name is specified, the
configured sni is used. However, this must only be done when SSL is in-use
for the server. Of course, it is uncommon to have a sni expression for
now-ssl server. But this may happen.
In addition, the SSL may be disabled via the CLI. In that case, the
pool-conn-name must be discarded if it was copied from the sni. And, we must
of course take care to set it if the ssl is enabled.
Finally, when the attac-srv action is checked, we now checked the
pool-conn-name expression.
This patch should be backported as far as 3.0. It relies on "MINOR: server:
Parse sni and pool-conn-name expressions in a dedicated function" which
should be backported too.
MINOR: server: Parse sni and pool-conn-name expressions in a dedicated function
This change is mandatory to fix an issue. The parsing of sni and
pool-conn-name expressions (from string to expression) is now handled in a
dedicated function. This will avoid to duplicate the same code at different
places.
BUG/MINOR: acl: Fix error message about several '-m' parameters
There is a typo in the commit * c51ddd5c3 ("MINOR: acl: Only allow one '-m'
matching method") . '*m' was reported in the error message instead of '-m'.
In addition, it is now mentionned that only the last one should be keep if
an old config triggers the error.
No backport needed, except if the commit above is backported.
Released version 3.3-dev8 with the following main changes :
- BUG/MEDIUM: mux-h2: fix crash on idle-ping due to unwanted ABORT_NOW
- BUG/MINOR: quic-be: missing Initial packet number space discarding
- BUG/MEDIUM: quic-be: crash after backend CID allocation failures
- BUG/MEDIUM: ssl: apply ssl-f-use on every "ssl" bind
- BUG/MAJOR: stream: Remove READ/WRITE events on channels after analysers eval
- MINOR: dns: dns_connect_nameserver: fix fd leak at error path
- BUG/MEDIUM: quic: reset padding when building GSO datagrams
- BUG/MINOR: quic: do not emit probe data if CONNECTION_CLOSE requested
- BUG/MAJOR: quic: fix INITIAL padding with probing packet only
- BUG/MINOR: quic: don't coalesce probing and ACK packet of same type
- MINOR: quic: centralize padding for HP sampling on packet building
- MINOR: http_ana: fix typo in http_res_get_intercept_rule
- BUG/MEDIUM: http_ana: handle yield for "stats http-request" evaluation
- MINOR: applet: Rely on applet flag to detect the new api
- MINOR: applet: Add function to test applet flags from the appctx
- MINOR: applet: Add a flag to know an applet is using HTX buffers
- MINOR: applet: Make some applet functions HTX aware
- MEDIUM: applet: Set .rcv_buf and .snd_buf functions on default ones if not set
- BUG/MEDIUM: mux-spop: Reject connection attempts from a non-spop frontend
- REGTESTS: jwt: create dynamically "cert.ecdsa.pem"
- BUG/MEDIUM: spoe: Improve error detection in SPOE applet on client abort
- MINOR: haproxy: abort config parsing on fatal errors for post parsing hooks
- MEDIUM: server: split srv_init() in srv_preinit() + srv_postinit()
- MINOR: proxy: handle shared listener counters preparation from proxy_postcheck()
- DOC: configuration: reword 'generate-certificates'
- BUG/MEDIUM: quic-be: avoid crashes when releasing Initial pktns
- BUG/MINOR: quic: reorder fragmented RX CRYPTO frames by their offsets
- MINOR: ssl: diagnostic warning when both 'default-crt' and 'strict-sni' are used
- MEDIUM: ssl: convert diag to warning for strict-sni + default-crt
- DOC: configuration: clarify 'default-crt' and implicit default certificates
- MINOR: quic: remove ->offset qf_crypto struct field
- BUG/MINOR: mux-quic: trace with non initialized qcc
- BUG/MINOR: acl: set arg_list->kw to aclkw->kw string literal if aclkw is found
- BUG/MEDIUM: mworker: fix startup and reload on macOS
- BUG/MINOR: connection: rearrange union list members
- BUG/MINOR: connection: remove extra session_unown_conn() on reverse
- MINOR: cli: display failure reason on wait command
- BUG/MINOR: server: decrement session idle_conns on del server
- BUG/MINOR: mux-quic: do not access conn after idle list insert
- MINOR: session: document explicitely that session_add_conn() is safe
- MINOR: session: uninline functions related to BE conns management
- MINOR: session: refactor alloc/lookup of sess_conns elements
- MEDIUM: session: protect sess conns list by idle_conns_lock
- MINOR: server: shard by thread sess_conns member
- MEDIUM: server: close new idle conns if server in maintenance
- MEDIUM: session: close new idle conns if server in maintenance
- MINOR: server: cleanup idle conns for server in maint already stopped
- MINOR: muxes: enforce thread-safety for private idle conns
- MEDIUM: conn/muxes/ssl: reinsert BE priv conn into sess on IO completion
- MEDIUM: conn/muxes/ssl: remove BE priv idle conn from sess on IO
- MEDIUM: mux-quic: enforce thread-safety of backend idle conns
- MAJOR: server: implement purging of private idle connections
- MEDIUM: session: account on server idle conns attached to session
- MAJOR: server: do not remove idle conns in del server
- BUILD: mworker: fix ignoring return value of â\80\98readâ\80\99
- DOC: unreliable sockpair@ on macOS
- MINOR: muxes: adjust takeover with buf_wait interaction
- OPTIM: backend: set release on takeover for strict maxconn
- DOC: configuration: confuse "strict-mode" with "zero-warning"
- MINOR: doc: add missing statistics column
- MINOR: doc: add missing statistics column
- MINOR: stats: display new curr_sess_idle_conns server counter
- MINOR: proxy: extend "show servers conn" output
- MEDIUM: proxy: Reject some header names for 'http-send-name-header' directive
- BUG/BUILD: stats: fix build due to missing stat enum definition
- DOC: proxy-protocol: Make example for PP2_SUBTYPE_SSL_SIG_ALG accurate
- CLEANUP: quic: remove a useless CRYPTO frame variable assignment
- BUG/MEDIUM: quic: CRYPTO frame freeing without eb_delete()
- BUG/MAJOR: mux-quic: fix crash on reload during emission
- MINOR: conn/muxes/ssl: add ASSUME_NONNULL() prior to _srv_add_idle
- REG-TESTS: map_redirect: Don't use hdr_dom in ACLs with "-m end" matching method
- MINOR: acl: Only allow one '-m' matching method
- MINOR: acl; Warn when matching method based on a suffix is overwritten
- BUG/MEDIUM: server: Duplicate healthcheck's alpn inherited from default server
- BUG/MINOR: server: Duplicate healthcheck's sni inherited from default server
- BUG/MINOR: acl: Properly detect overwritten matching method
- BUG/MINOR: halog: Add OOM checks for calloc() in filter_count_srv_status() and filter_count_url()
- BUG/MINOR: log: Add OOM checks for calloc() and malloc() in logformat parser and dup_logger()
- BUG/MINOR: acl: Add OOM check for calloc() in smp_fetch_acl_parse()
- BUG/MINOR: cfgparse: Add OOM check for calloc() in cfg_parse_listen()
- BUG/MINOR: compression: Add OOM check for calloc() in parse_compression_options()
- BUG/MINOR: tools: Add OOM check for malloc() in indent_msg()
- BUG/MINOR: quic: ignore AGAIN ncbuf err when parsing CRYPTO frames
- MINOR: quic/flags: complete missing flags
- BUG/MINOR: quic: fix room check if padding requested
- BUG/MINOR: quic: fix padding issue on INITIAL retransmit
- BUG/MINOR: quic: pad Initial pkt with CONNECTION_CLOSE on client
- MEDIUM: quic: strengthen BUG_ON() for unpad Initial packet on client
- DOC: configuration: rework the jwt_verify keyword documentation
- BUG/MINOR: haproxy: be sure not to quit too early on soft stop
- BUILD: acl: silence a possible null deref warning in parse_acl_expr()
- MINOR: quic: Add more information about RX packets
- CI: fix syntax of Quic Interop pipelines
- MEDIUM: cfgparse: warn when using user/group when built statically
- BUG/MEDIUM: stick-tables: don't leave the expire loop with elements deleted
- BUG/MINOR: stick-tables: never leave used entries without expiration
- BUG/MEDIUM: peers: don't fail twice to grab the update lock
- MINOR: stick-tables: limit the number of visited nodes during expiration
- OPTIM: stick-tables: exit expiry faster when the update lock is held
- MINOR: counters: retrieve detailed errmsg upon failure with counters_{fe,be}_shared_prepare()
- MINOR: stats-file: introduce shm-stats-file directive
- MEDIUM: stats-file: processes share the same clock source from shm-stats-file
- MINOR: stats-file: add process slot management for shm stats file
- MEDIUM: stats-file/counters: store and preload stats counters as shm file objects
- DOC: config: document "shm-stats-file" directive
- OPTIM: stats-file: don't unnecessarily die hard on shm_stats_file_reuse_object()
- MINOR: compiler: add ALWAYS_PAD() macro
- BUILD: stats-file: fix aligment issues
- MINOR: stats-file: reserve some bytes in exported structs
- MEDIUM: stats-file: add some BUG_ON() guards to ensure exported structs are not changed by accident
- BUG/MINOR: check: ensure check-reuse is compatible with SSL
- BUG/MINOR: check: fix dst address when reusing a connection
- REGTESTS: explicitly use "balance roundrobin" where RR is needed
- MAJOR: backend: switch the default balancing algo to "random"
- BUG/MEDIUM: conn: fix UAF on connection after reversal on edge
- BUG/MINOR: connection: streamline conn detach from lists
- BUG/MEDIUM: quic-be: too early SSL_SESSION initialization
- BUG/MINOR: log: fix potential memory leak upon error in add_to_logformat_list()
- MEDIUM: init: always warn when running as root without being asked to
- MINOR: sample: Add base2 converter
- MINOR: version: add -vq, -vqb, and -vqs flags for concise version output
- BUILD: trace: silence a bogus build warning at -Og
- MINOR: trace: accept trace spec right after "-dt" on the command line
- BUILD: makefile: bump the default minimum linux version to 4.17
BUILD: makefile: bump the default minimum linux version to 4.17
As explained during the 3.3-dev7 announcement below:
https://www.mail-archive.com/haproxy@formilux.org/msg46073.html
no regularly maintained distro supports a kernel older than 4.18 anymore,
and KTLS is supported since 4.17. So it's about the right moment to bump
the default minimum kernel version supported by glibc and musl to
automatically cover new features. The linux-glibc-legacy target still
supports 2.6.28 and above.
MINOR: trace: accept trace spec right after "-dt" on the command line
I continue to mistakenly set the traces using "-dtXXX" and to have to
refer to the doc to figure that it requires a separate argument and
differs from some other options. Worse, "-dthelp" doesn't say anything
and silently ignores the argument.
Let's make the parser take whatever follows "-dt" as the argument if
present, otherwise take the next one (as it currently does). Doing
this even allows to simplify the code, and is easier to figure the
syntax since "-dthelp" now works.
BUILD: trace: silence a bogus build warning at -Og
gcc-13.3 at -Og emits an incorrect build warning in trace.c about a
possibly initialized variable:
In file included from include/haproxy/api.h:35,
from src/trace.c:22:
src/trace.c: In function 'trace_parse_cmd':
include/haproxy/bug.h:431:17: warning: 'arg' may be used uninitialized [-Wmaybe-uninitialized]
431 | free(*__x); \
| ^~~~~~~~~~
src/trace.c:1136:9: note: in expansion of macro 'ha_free'
1136 | ha_free(&oarg);
| ^~~~~~~
src/trace.c:1008:15: note: 'arg' was declared here
1008 | char *arg, *oarg;
| ^~~
The warning is obviously wrong since the field is initialized in one of
the two branches of an "if" whose complementary one returns. But the
compiler doesn't seem to see this because the if is in fact two ifs each
with an opposite condition: "if (arg_src)" then "if (!arg_src)". Let's
just move upwards the default one that returns and eliminate the other
one. Reading the diff with "git diff -b" better shows the tiny change.
MINOR: version: add -vq, -vqb, and -vqs flags for concise version output
This patch introduces three new command line flags to display HAProxy version
info more flexibly:
- `-vqs` outputs the short version string without commit info (e.g., "3.3.1").
- `-vqb` outputs only the branch (major.minor) part of the version (e.g., "3.3").
- `-vq` outputs the full version string with suffixes (e.g., "3.3.1-dev5-1bb975-71").
This allows easier parsing of version info in automation while keeping existing -v and -vv behaviors.
The command line argument parsing now calls `display_version_plain()` with a
display_mode parameter to select the desired output format. The function handles
stripping of commit or patch info as needed, depending on the mode.
This commit adds the base2 converter to turn binary input into it's
string representation. Each input byte is converted into a series of
eight characters which are either 0s and 1s by bit-wise comparison.
MEDIUM: init: always warn when running as root without being asked to
Like many exposed network deamons, haproxy does normally not need to run
as root and strongly recommends against this, unless strictly necessary.
On some operating systems, capabilities even totally alleviate this need.
Lately, maybe due to a raise of containerization or automated config
generation or a bit of both, we've observed a resurgence of this bad
practice, possibly due to the fact that users are just not aware of the
conditions they're using their daemon.
Let's add a warning at boot when starting as root without having requested
it using "uid" or "user". And take this opportunity for warning the user
about the existence of capabilities when supported, and encouraging the
use of a chroot.
This is achieved by leaving global.uid set to -1 by default, allowing us
to detect if it was explicitly set or not.
BUG/MINOR: log: fix potential memory leak upon error in add_to_logformat_list()
As reported on GH #3099, upon memory error add_to_logformat_list() will
return and error but it fails to properly memory which was allocated
within the function, which could result in memory leak.
Let's free all relevant variables allocated by the function before returning.
No backport needed unless 22ac1f5ee ("("BUG/MINOR: log: Add OOM checks for
calloc() and malloc() in logformat parser and dup_logger()") is.
BUG/MEDIUM: quic-be: too early SSL_SESSION initialization
When an SNI is set on a QUIC server line, ssl_sock_set_servername() is called
from connect_server() (backend.c). This leads some BUG_ON() to be triggered
because the CO_FL_WAIT_L6_CONN | CO_FL_SSL_WAIT_HS were not set. This must
be done into the ->init() xprt callback. This patch move the flags settings
from ->start() to ->init() callback.
Indeed, connect_server() calls these functions in this order:
->init(),
ssl_sock_set_servername() # => crash if CO_FL_WAIT_L6_CONN | CO_FL_SSL_WAIT_HS not set
->start()
Furthermore ssl_sock_set_servername() has a side effect to reset the SSL_SESSION
object (attached to SSL object) calling SSL_set_session(), leading to crashes as follows:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./haproxy -f quic_srv.cfg'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 tls_process_server_hello (s=0x560c259733b0, pkt=0x7fffac239f20)
at ssl/statem/statem_clnt.c:1624
1624 if (s->session->session_id_length > 0) {
[Current thread is 1 (Thread 0x7fc364e53dc0 (LWP 35514))]
(gdb) bt
#0 tls_process_server_hello (s=0x560c259733b0, pkt=0x7fffac239f20)
at ssl/statem/statem_clnt.c:1624
#1 0x00007fc36540fba4 in ossl_statem_client_process_message (s=0x560c259733b0,
pkt=0x7fffac239f20) at ssl/statem/statem_clnt.c:1042
#2 0x00007fc36540d028 in read_state_machine (s=0x560c259733b0) at ssl/statem/statem.c:646
#3 0x00007fc36540ca70 in state_machine (s=0x560c259733b0, server=0)
at ssl/statem/statem.c:439
#4 0x00007fc36540c576 in ossl_statem_connect (s=0x560c259733b0) at ssl/statem/statem.c:250
#5 0x00007fc3653f1698 in SSL_do_handshake (s=0x560c259733b0) at ssl/ssl_lib.c:3835
#6 0x0000560c22620327 in qc_ssl_do_hanshake (qc=qc@entry=0x560c25961f60,
ctx=ctx@entry=0x560c25963020) at src/quic_ssl.c:863
#7 0x0000560c226210be in qc_ssl_provide_quic_data (len=90, data=<optimized out>,
ctx=0x560c25963020, level=ssl_encryption_initial, ncbuf=0x560c2588bb18)
at src/quic_ssl.c:1071
#8 qc_ssl_provide_all_quic_data (qc=qc@entry=0x560c25961f60, ctx=0x560c25963020)
at src/quic_ssl.c:1123
#9 0x0000560c2260ca5f in quic_conn_io_cb (t=0x560c25962f80, context=0x560c25961f60,
state=<optimized out>) at src/quic_conn.c:791
#10 0x0000560c228255ed in run_tasks_from_lists (budgets=<optimized out>) at src/task.c:648
#11 0x0000560c22825f7a in process_runnable_tasks () at src/task.c:889
#12 0x0000560c22793dc7 in run_poll_loop () at src/haproxy.c:2836
#13 0x0000560c22794481 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3056
#14 0x0000560c2259082d in main (argc=<optimized out>, argv=<optimized out>)
at src/haproxy.c:3667
<s> is the SSL object, and <s->session> is the SSL_SESSION object.
For the client, this is the first call do SSL_do_handshake() which initializes this
SSL_SESSION object from ->init() xpt callback. Then it is reset by
ssl_sock_set_servername(), then tls_process_server_hello() TLS stack is called with
NULL value for s->session when receiving the ServerHello TLS message.
To fix this, simply move the first call to SSL_do_handshake to ->start xprt call
back (qc_xprt_start()).
BUG/MINOR: connection: streamline conn detach from lists
Over their lifetime, connections are attached to different list. These
lists depends on whether connection is on frontend or backend side.
Attach point members are stored via a union in struct connection. The
next commit reorganizes them so that a proper frontend/backend
separation is performed :
On conn_free(), connection instance must be removed from these lists to
ensure there is no use-after-free case. However code was still shaky
there, despite no real issue. Indeed, <toremove_list> was detached for
all connections, despite being only used on backend side only.
This patch streamlines the freeing of connection. Now, <toremove_list>
detach is performed in conn_backend_deinit(). Moreover, a new helper
conn_frontend_deinit() is defined. It ensures that <stopping_list>
detach is done. Prior it was performed individually by muxes.
Note that a similar procedure is performed when the connection is
reversed. Hence, conn_frontend_deinit() is now used here as well,
rendering reversal from FE to BE or vice versa symmetrical.
As mentionned above, no crash occured prior to this patch, but the code
was fragile, in particular access to <toremove_list> for frontend
connections. Thus this patch is considered as a bug fix worthy of a
backport along with above mentionned patch, currently up to 3.0.
BUG/MEDIUM: conn: fix UAF on connection after reversal on edge
When a connection is reversed, some elements must be resetted prior to
reusing it. Most notably, connection must be removed from lists specific
on frontend/backend sides.
When reverse was performed for frontend to backend side, connection was
not removed via its <stopping_list> attach point. On previous releases,
this did not cause any issue. However, crashes start to occur recently,
probably due to the recent reorganization of connection list attach
points from the following patch.
MAJOR: backend: switch the default balancing algo to "random"
For many years, an unset load balancing algorithm would use "roundrobin".
It was shown several times that "random" with at least 2 draws (the
default) generally provides better performance and fairness in that
it will automatically adapt to the server's load and capacity. This
was further described with numbers in this discussion:
BTW there were no objection and only support for the change.
The goal of this patch is to change the default algo when none is
specified, from "roundrobin" to "random". This way, users who don't
care and don't set the load balancing algorithm will benefit from a
better one in most cases, while those who have good reasons to prefer
roundrobin (for session affinity or for reproducible sequences like used
in regtests) can continue to specify it.
The vast majority of users should not notice a difference.
REGTESTS: explicitly use "balance roundrobin" where RR is needed
A few tests explicitly rely on the server ordering granted by
"balance roundrobin", but didn't specify the balance algorithm.
As it will change soon, let's explicit it.
BUG/MINOR: check: fix dst address when reusing a connection
The keyword check-reuse-pool allows to reuse an idle connection to
perform a health check instead of opening a new one. It is implemented
similarly to HTTP transfer reuse : a hash is calculated with a subset of
properties to lookup a connection with the same characteristics.
One of these properties is the destination address. Initially it was
always set to NULL prior to reuse check, as this is necessary to match
connections on a reverse-HTTP server. However, this prevents reuse on
other servers with a proper address configured. Indeed, in this case
destination address is always used as key for connections inserted in
idle pool.
This patch fixes this by properly setting destination address for check
reuse. By default, it reuses the address from the server. The only
exception is if the server is using reverse-HTTP, in which case address
remains NULL.
A new test is also performed prior to try check reuse to ensure this is
not performed on a transparent server. Indeed, in this case server
address would be unset. Anyway, check cannot reuse a connection in this
case so this is OK. Note that this does not prevent to continue check
with a newly connection with a NULL address : this should be handled
more properly in another patch.
BUG/MINOR: check: ensure check-reuse is compatible with SSL
SSL may be activated implicitely if a server relies on SSL, even without
check-ssl keyword. This is performed by init_srv_check() function. The
main operation is to change xprt layer for check to SSL.
Prior to this patch, <use_ssl> check member was also set, despite not
strictly necessary. This has a negative side-effect of rendering
check-reuse-pool ineffective. Indeed, reuse on check is only performed
if no specific check configuration has been specified (see
tcpcheck_use_nondefault_connect()).
This patch fixes check reuse with SSL : <use_ssl> is not set in case SSL
is inherited implicitely from server configuration. Thus, <use_ssl> is
now only set if an explicit check-ssl keyword is set, which disables
connection reuse for check.
MEDIUM: stats-file: add some BUG_ON() guards to ensure exported structs are not changed by accident
Add two BUG_ON() in shm_stats_file_prepare() which will trigger if
exported structures (shm_stats_file_hdr and shm_stats_file_object) change
in size, because it means that they will become incompatible with older
versions and thus precautions should be taken by the developer to ensure
compatibility with olders versions, or at least detect incompatible
versions by changing the version number to prevent bugs resulting
from inconsistent mapping between versions. The BUG_ON() may be
safely adjusted then.
Please note that it doesn't protect against accidental struct member
re-ordering if the resulting struct size is equal..
MINOR: stats-file: reserve some bytes in exported structs
We may need additional struct members in shm_stats_file_object and
shm_stats_file_hdr, yet since these structs are exported they should
not change in size nor ordering else it would require a version change
to break compability on purpose since mapping would differ.
Here we reserve 64 additional bytes in shm_stats_file_object, and
128 bytes in shm_stats_file_hdr for future usage.
Document some byte holes and fix some potential aligment issues
between 32 and 64 bits architectures to ensure the shm_stats_file memory
mapping is consistent between operating systems.
same as THREAD_PAD() but doesn't depend on haproxy being compiled with
thread support. It may be useful for memory (or files) that may be
shared between multiple processed.
OPTIM: stats-file: don't unnecessarily die hard on shm_stats_file_reuse_object()
shm_stats_file_reuse_object() has a non negligible cost, especially if
the shm file contains a lot of objects because the functions scans the
whole shm file to find available slots.
During startup, if no existing objects could be mapped in the shm
file shm_stats_file_add_object() for each object (server, fe, be or
listener) with a GUID set. On large config it means
shm_stats_file_add_object() could be called a lot of times in a row.
With current implementation, each shm_stats_file_add_object() call
leverages shm_stats_file_reuse_object(), so the more objects are defined
in the config, the slower the startup will be.
To try to optimize startup time a bit with large configs, we don't
sytematically call shm_stats_file_reuse_object(), especially when we
know that the previous attempt to reuse objects failed. In this case
we add a small tempo between failed attempts to reuse objects because
we assume the new attempt will probably fail anyway. (For slots to
become available, either an old process has to clean its entries,
or they have to time out which implies that the clock needs to be updated)
Add some documentation for "shm-stats-file" and
"shm-stats-file-max-objects" experimental directives related to the use
of shared memory for storing stats counters (see previous commits for
implementation details)
MEDIUM: stats-file/counters: store and preload stats counters as shm file objects
This is the last patch of the shm stats file series, in this patch we
implement the logic to store and fetch shm stats objects and associate
them to existing shared counters on the current process.
Shm objects are stored in the same memory location as the shm stats file
header. In fact they are stored right after it. All objects (struct
shm_stats_file_object) have the same size (no matter their type), which
allows for easy object traversal without having to check the object's
type, and could permit the use of external tools to scan the SHM in the
future. Each object stores a guid (of GUID_MAX_LEN+1 size) and tgid
which allows to match corresponding shared counters indexes. Also,
as stated before, each object stores the list of users making use of
it. Objects are never released (the map can only grow), but unused
objects (when no more users or active users are found in objects->users),
the object is automatically recycled. Also, each object stores its
type which defines how the object generic data member should be handled.
Upon startup (or reload), haproxy first tries to scan existing shm to
find objects that could be associated to frontends, backends, listeners
or servers in the current config based on GUID. For associations that
couldn't be made, haproxy will automatically create missing objects in
the SHM during late startup. When haproxy matches with an existing object,
it means the counter from an older process is preserved in the new
process, so multiple processes temporarily share the same counter for as
long as required for older processes to eventually exit.
MINOR: stats-file: add process slot management for shm stats file
Now that all processes tied to the same shm stats file now share a
common clock source, we introduce the process slot notion in this
patch.
Each living process registers itself in a map at a free index: each slot
stores information about the process' PID and heartbeat. Each process is
responsible for updating its heartbeat, a slot is considered as "free" if
the heartbeat was never set or if the heartbeat is expired (60 seconds of
inactivity). The total number of slots is set to 64, this is on purpose
because it allows to easily store the "users" of a given shm object using
a 64 bits bitmask. Given that when haproxy is reloaded olders processes
are supposed to die eventually, it should be large enough (64 simultaneous
processes) to be safe. If we manage to reach this limit someday, more
slots could be added by splitting "users" bitmask on multiple 64bits
variable.
MEDIUM: stats-file: processes share the same clock source from shm-stats-file
The use of the "shm-stats-file" directive now implies that all processes
using the same file now share a common clock source, this is required
for consistency regarding time-related operations.
The clock source is stored in the shm stats file header.
When the directive is set, all processes share the same clock
(global_now_ms and global_now_ns both point to variables in the map),
this is required for time-based counters such as freq counters to work
consistently. Since all processes manipulate global clock with atomic
operations exclusively during runtime, and don't systematically relies
on it (thanks to local now_ms and now_ns), it is pretty much transparent.
add initial support for the "shm-stats-file" directive and
associated "shm-stats-file-max-objects" directive. For now they are
flagged as experimental directives.
The shared memory file is automatically created by the first process.
The file is created using open() so it is up to the user to provide
relevant path (either on regular filesystem or ramfs for performance
reasons). The directive takes only one argument which is path of the
shared memory file. It is passed as-is to open().
The maximum number of objects per thread-group (hard limit) that can be
stored in the shm is defined by "shm-stats-file-max-objects" directive,
Upon initial creation, the main shm stats file header is provisioned with
the version which must remains the same to be compatible between processes
and defaults to 2k. which means approximately 1mb max per thread group
and should cover most setups. When the limit is reached (during startup)
an error is reported by haproxy which invites the user to increase the
"shm-stats-file-max-objects" if desired, but this means more memory will
be allocated. Actual memory usage is low at start, because only the mmap
(mapping) is provisionned with the maximum number of objects to avoid
relocating the memory area during runtime, but the actual shared memory
file is dynamically resized when objects are added (resized by following
half power of 2 curve when new objects are added, see upcoming commits)
For now only the file is created, further logic will be implemented in
upcoming commits.