DEBUG: pollers/fd: add thread id suffix to per-thread memory areas name hints
Willy reported that since abb8412d2 ("DEBUG: pollers: add name hint for
large memory areas used by pollers") and 22ec2ad8b ("DEBUG: fd: add name
hint for large memory areas") multiple maps with the same name could be
found in /proc/<pid>/maps when haproxy process is started with multiple
threads, which can be annoying.
In fact this happens because some poller and fd-created memory areas are
being created for each available thread, and since the naming was done
using vma_set_name() with the same <type> and <name> inputs, the resulting
name was the same for all threads.
Thanks to the previous commit, we now use vma_set_name_id() for naming
per-thread memory areas so that "-id" prefix is appended after the name
name, where "id" equals to 'tid+1' (to match the thread numbering logic
found in config file or in ha_panic() report), allowing to easily
identify which haproxy thread owns the map in /proc/<pid>/maps:
Just like vma_set_name() from 51a8f134e ("DEBUG: tools: add vma_set_name()
helper"), but also takes <id> as parameter to append "-$id" suffix after
the name in order to differentiate 2 areas that were named using the same
<type> and <name> combination.
example, using mmap + MAP_SHARED|MAP_ANONYMOUS: 7364c4fff000-736508000000 rw-s 00000000 00:01 3540 [anon_shmem:type:name-id]
Another example, using mmap + MAP_PRIVATE|MAP_ANONYMOUS or using
glibc/malloc() above MMAP_THRESHOLD: 7364c4fff000-736508000000 rw-s 00000000 00:01 3540 [anon:type:name-id]
CLEANUP: tools: fix vma_set_name() function comment
There was a typo in the example provided in vma_set_name(): maps named
using the function will show up as "type:name", not "type.name", updating
the comment to reflect the current behavior.
Willy Tarreau [Thu, 23 May 2024 18:31:45 +0000 (20:31 +0200)]
MINOR: stick-tables: remove the uneeded read lock in stksess_free()
During changes made in 2.7 by commits 8d3c3336f9 ("MEDIUM: stick-table:
make stksess_kill_if_expired() avoid the exclusive lock") and 996f1a5124
("MEDIUM: stick-table: do not take a lock to update t->current anymore."),
the operation was done cautiously one baby step at a time and the final
cleanup was not done, as we're keeping a read lock under an atomic dec.
Furthermore there's a pool_free() call under that lock, and we try to
avoid pool_alloc() and pool_free() under locks for their nasty side
effects (e.g. when memory gets recompacted), so let's really drop it
now.
Note that the performance gain is not really perceptible here, it's
essentially for code clarity reasons that this has to be done.
Willy Tarreau [Thu, 23 May 2024 18:24:35 +0000 (20:24 +0200)]
CLEANUP: stick-tables: remove a few unneeded tests for use_wrlock
Due to the code in stktable_touch_with_exp() being the same as in other
functions previously made around a loop trying first to upgrade a read
lock then to fall back to a direct write lock, there remains a confusing
construct with multiple tests on use_wrlock that is obviously zero when
tested. Let's remove them since the value is known and the loop does not
exist anymore.
Willy Tarreau [Thu, 23 May 2024 18:06:25 +0000 (20:06 +0200)]
BUG/MEDIUM: stick-tables: make sure never to create two same remote entries
In GH issue #2552, Christian Ruppert reported an increase in crashes
with recent 3.0-dev versions, always related with stick-tables and peers.
One particularity of his config is that it has a lot of peers.
While trying to reproduce, it empirically was found that firing 10 load
generators at 10 different haproxy instances tracking a random key among
100k against a table of max 5k entries, on 8 threads and between a total
of 50 parallel peers managed to reproduce the crashes in seconds, very
often in ebtree deletion or insertion code, but not only.
The debugging revealed that the crashes are often caused by a parent node
being corrupted while delete/insert tries to update it regarding a recently
inserted/removed node, and that that corrupted node had always been proven
to be deleted, then immediately freed, so it ought not be visited in the
tree from functions enclosed between a pair of lock/unlock. As such the
only possibility was that it had experienced unexpected inserts. Also,
running with pool integrity checking would 90% of the time cause crashes
during allocation based on corrupted contents in the node, likely because
it was found at two places in the same tree and still present as a parent
of a node being deleted or inserted (hence the __stksess_free and
stktable_trash_oldest callers being visible on these items).
Indeed the issue is in fact related to the test set (occasionally redundant
keys, many peers). What happens is that sometimes, a same key is learned
from two different peers. When it is learned for the first time, we end up
in stktable_touch_with_exp() in the "else" branch, where the test for
existence is made before taking the lock (since commit cfeca3a3a3
("MEDIUM: stick-table: touch updates under an upgradable read lock") that
was merged in 2.9), and from there the entry is added. But is one of the
threads manages to insert it before the other thread takes the lock, then
the second thread will try to insert this node again. And inserting an
already inserted node will corrupt the tree (note that we never switched
to enforcing a check in insertion code on this due to API history that
would break various code parts).
Here the solution is simple, it requires to recheck leaf_p after getting
the lock, to avoid touching anything if the entry has already been
inserted in the mean time.
Many thanks to Christian Ruppert for testing this and for his invaluable
help on this hard-to-trigger issue.
BUG/MEDIUM: stick-tables: Fix race with peers when killing a sticky session
When a sticky session is killed, we must be sure no other entity is still
referencing it. The session's ref_cnt must be 0. However, there is a race
with peers, as decribed in 21447b1dd4 ("BUG/MAJOR: stick-tables: fix race
with peers in entry expiration"). When the update lock is acquire, we must
recheck the ref_cnt value.
This patch is part of a debugging session about issue #2552. It must be
backported to 2.9.
BUG/MEDIUM: stick-tables: Fix race with peers when trashing oldest entries
It is the same that the one fixed in process_table_expire() (21447b1dd4
["BUG/MAJOR: stick-tables: fix race with peers in entry expiration"]). In
stktable_trash_oldest(), when the update lock is acquired, we must take care
to check again the ref_cnt because some peers may increment it (See commit
above for details).
This patch fixes a crash mentionned in 2552#issuecomment-2110532706. It must
be backported to 2.9.
Willy Tarreau [Fri, 24 May 2024 09:04:30 +0000 (11:04 +0200)]
BUILD: quic: fix unused variable warning when threads are disabled
The tree variable was introduced in 3.0 by commit dd58dff1e6
("BUG/MEDIUM: quic: QUIC CID removed from tree without locking") which
was marked for backport. The variable is only used for locks.
Let's just mark the variable __maybe_unused for when the code is
built without threads.
The patch above was marked for backport to 2.7 so this should be
backported wherever the fix was backported.
Willy Tarreau [Fri, 24 May 2024 07:46:49 +0000 (09:46 +0200)]
MINOR: config: add thread-hard-limit to set an upper bound to nbthread
On todays large systems, it's not always desired to run on all threads
for light loads, and usually users enforce nbthread to a lower value
(e.g. 8). The problem is that this is a fixed value, and moving such
configs to smaller machines continues to enforce the value and this
becomes extremely unproductive due to having more threads than CPUs.
This also happens quite a bit in VMs, containers, or cloud instances
of various sizes.
This commit introduces the thread-hard-limit setting that allows to only
set an upper bound to the number of threads without raising a lower value.
This means that using "thread-hard-limit 8" will make sure that no more
than 8 threads will be used when available, but it will remain two when
run on a dual-core machine.
MINOR: mux-quic: Set abort info for SC-less QCS on STOP_SENDING frame
It is a revert of cc9827bb09 ("BUG/MEDIUM: mux-quic: fix crash on
STOP_SENDING received without SD"). This fix was based on a wrong assumption
about QUIC streams that may have no stream-endpoint descriptor. However, it
must never happen. And this was fixed. So we can now safely revert the
commit above. However, it is not a bugfix because, for now, abort info are
only used by the upper layer. So it is not a big deal to not set it when
there is no SC.
BUG/MEDIUM: mux-quic: Create sedesc in same time of the QUIC stream
Recent changes to save abort reason revealed an issue during the QUIC stream
creation. Indeed, by design, when a mux stream is created, it must always
have a valid stream-endpoint descriptor and it must remain valid till the
mux stream destruction. On frontend side, it is the multiplexer
responsibility to create it and set it as orphan. On the backend side, the
sedesc is provided by the upper layer. It is the sedesc of the back
stream-connector.
For the QUIC multiplexer, the stream-endpoint descriptor was only created
when the stream-connector was created and attached on it. It is unexpected
and some bugs may be introduced because there is no valid sedesc on a QUIC
stream. And a recent bug was introduced for this reason.
Willy Tarreau [Wed, 22 May 2024 13:56:27 +0000 (15:56 +0200)]
BUG/MEDIUM: quic_tls: prevent LibreSSL < 4.0 from negotiating CHACHA20_POLY1305
As diagnosed in GH issue #2569, there's currently an issue in LibreSSL's
CHACHA20 in-place implementation that makes haproxy discard incoming QUIC
packets encrypted with it. It's not very easy to observe the issue because:
- QUIC recommends that CHACHA20 is used in priority
- on x86 with AES-NI, LibreSSL prefers AES-GCM for performance
reasons, so the problem is only observed there if a client
explicitly forces TLS_CHACHA20_POLY1305_SHA256 only.
- discarded packets cause retransmits showing some apparent activity,
and the handshake succeeds so it's not easy to analyze from the
client which thinks that the server is slow to respond.
Thus in practice, on non-x86 machines running LibreSSL, requests made over
QUIC freeze for a long time, unless the client explicitly forces algos
excluding TLS_CHACHA20_POLY1305_SHA256. That's typically the case by
default on modern OpenBSD systems, and was reported in the issue above
for an arm64 machine running OpenBSD -current, and was also observed on a
mips64 one running OpenBSD 7.5.
There is no simple solution to this problem due to some of the protocol's
constraints without digging too low into the stack (and risking to break
more). Here we're taking a pragmatic approach consisting in making the
connection fail hard when TLS_CHACHA20_POLY1305_SHA256 is selected,
regardless of the availability of other ciphers. This means that every
time a connection would have hung, instead it will fail fast, allowing
the client to retry over TLS/TCP.
Theo Buehler recommends that we limit this protection to all LibreSSL
versions before 4.0 since it's where the fix will be implemented. Older
stable versions will just see TLS_CHACHA20_POLY1305_SHA256 disabled,
which should be sufficient to make QUIC work there again as well.
The following config is sufficient to reproduce the issue (on a non-x86
machine, both arm64 & mips64 were confirmed to reproduce it):
And the following commands will trigger the problem on affected LibreSSL
versions:
curl --tls13-ciphers TLS_CHACHA20_POLY1305_SHA256 -v --http3 -k https://127.0.0.1:8443/
curl -v --http3 -k https://127.0.0.1:8443/
while these ones must work:
curl --tls13-ciphers TLS_AES_128_GCM_SHA256 -v --http3 -k https://127.0.0.1:8443/
curl --tls13-ciphers TLS_AES_256_GCM_SHA384 -v --http3 -k https://127.0.0.1:8443/
Normally all of them will work with LibreSSL 4, and only the first one
should fail with stable LibreSSL versions higher than 3.9.2. An haproxy
version without this workaround will show an unresponsive command after
the GET is sent, while a version with the workaround will close the
connection on error. On a version with this workaround, if TCP listeners
are uncommented, curl will automatically fall back to TCP and attempt
the reqeust again over HTTP/2. Finally, on OpenSSL 1.1.1 in compat mode
(hence the limited-quic option above) all of them must work.
Many thanks to github user @lgv5 for the detailed report, tests, and
for spotting the issue, and to @botovq (Theo Buehler) for the quick
analysis, patch and help on this workaround.
This needs to be backported to versions 2.6 and above.
BUG/MAJOR: quic: Crash with TLS_AES_128_CCM_SHA256 (libressl only)
At least 3.9.0 version of libressl TLS stack does not behave as others stacks like quictls which
make SSL_do_handshake() return an error when no cipher could be negotiated
in addition to emit a TLS alert(0x28). This is the case when TLS_AES_128_CCM_SHA256
is forced as TLS1.3 cipher from the client side. This make haproxy enter a code
path which leads to a crash as follows:
[Switching to Thread 0x7ffff76b9640 (LWP 23902)]
0x0000000000487627 in quic_tls_key_update (qc=qc@entry=0x7ffff00371f0) at src/quic_tls.c:910
910 struct quic_kp_trace kp_trace = {
(gdb) list
905 {
906 struct quic_tls_ctx *tls_ctx = &qc->ael->tls_ctx;
907 struct quic_tls_secrets *rx = &tls_ctx->rx;
908 struct quic_tls_secrets *tx = &tls_ctx->tx;
909 /* Used only for the traces */
910 struct quic_kp_trace kp_trace = {
911 .rx_sec = rx->secret,
912 .rx_seclen = rx->secretlen,
913 .tx_sec = tx->secret,
914 .tx_seclen = tx->secretlen,
(gdb) p qc
$1 = (struct quic_conn *) 0x7ffff00371f0
(gdb) p qc->ael
$2 = (struct quic_enc_level *) 0x0
(gdb) bt
#0 0x0000000000487627 in quic_tls_key_update (qc=qc@entry=0x7ffff00371f0) at src/quic_tls.c:910
#1 0x000000000049bca9 in qc_ssl_provide_quic_data (len=268, data=<optimized out>, ctx=0x7ffff0047f80, level=<optimized out>, ncbuf=<optimized out>) at src/quic_ssl.c:617
#2 qc_ssl_provide_all_quic_data (qc=qc@entry=0x7ffff00371f0, ctx=0x7ffff0047f80) at src/quic_ssl.c:688
#3 0x00000000004683a7 in quic_conn_io_cb (t=0x7ffff0047f30, context=0x7ffff00371f0, state=<optimized out>) at src/quic_conn.c:760
#4 0x000000000063cd9c in run_tasks_from_lists (budgets=budgets@entry=0x7ffff76961f0) at src/task.c:596
#5 0x000000000063d934 in process_runnable_tasks () at src/task.c:876
#6 0x0000000000600508 in run_poll_loop () at src/haproxy.c:3073
#7 0x0000000000600b67 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3287
#8 0x00007ffff7f6ae45 in start_thread () from /lib64/libpthread.so.0
#9 0x00007ffff78254af in clone () from /lib64/libc.so.6
When a TLS alert is emitted, haproxy calls quic_set_connection_close() which sets
QUIC_FL_CONN_IMMEDIATE_CLOSE connection flag. This is this flag which is tested
by this patch to make the handshake fail even if SSL_do_handshake() does not
return an error. This test is specific to libressl and never run with
others TLS stack.
Thank you to @lgv5 and @botovq for having reported this issue in GH #2569.
Set stream_err value as SF_ERR_NONE, if obtained socket fd has passed all
common runtime and configuration related checks.
'.connect()' method implementation in higher protocol layers requires Stream
Error Flag as the return value. So, at the socket layer, we need to pass to
sock_create_server_socket() a variable to set this flag, because syscalls and
some socket options checks are convenient to performe at the socket layer.
Willy Tarreau [Wed, 22 May 2024 09:12:32 +0000 (11:12 +0200)]
MINOR: traces: enumerate the list of levels/verbosities when not found
It's quite frustrating, particularly on the command line, not to have
access to the list of available levels and verbosities when one does
not exist for a given source, because there's no easy way to find them
except by starting without and connecting to the CLI. Let's enumerate
the list of supported levels and verbosities when a name does not match.
For example:
$ ./haproxy -db -f quic-repro.cfg -dt h2:help
[NOTICE] (9602) : haproxy version is 3.0-dev12-60496e-27
[NOTICE] (9602) : path to executable is ./haproxy
[ALERT] (9602) : -dt: no such trace level 'help', available levels are 'error', 'user', 'proto', 'state', 'data', and 'developer'.
$ ./haproxy -db -f quic-repro.cfg -dt h2:user:help
[NOTICE] (9604) : haproxy version is 3.0-dev12-60496e-27
[NOTICE] (9604) : path to executable is ./haproxy
[ALERT] (9604) : -dt: no such trace verbosity 'help' for source 'h2', available verbosities for this source are: 'quiet', 'clean', 'minimal', 'simple', 'advanced', and 'complete'.
The same is done for the CLI where the existing help message is always
displayed when entering an invalid verbosity or level.
Amaury Denoyelle [Thu, 16 May 2024 15:46:36 +0000 (17:46 +0200)]
MINOR: connection: support PROXY v2 TLV emission without stream
Update API for PROXY protocol header encoding. Previously, it requires
stream parameter to be set. Change make_proxy_line() and associated
functions to add an extra session parameter. This is useful in context
where no stream is instantiated. For example, this is the case for rhttp
preconnect.
This change allows to extend PROXY v2 TLV encoding. Replace
build_logline() which requires a stream instance and call directly
sess_build_logline().
Note that stream parameter is kept as it is necessary for unique ID
encoding.
This change has no functional change for standard connections. However,
it is necessary to support TLV encoding on rhttp preconnect.
Amaury Denoyelle [Tue, 14 May 2024 14:34:49 +0000 (16:34 +0200)]
MINOR: rhttp: support PROXY emission on preconnect
Extend preconnect to support PROXY protocol emission. Code is duplicated
from connect_server() into new_reverse_conn(). This is necessary to
support send-proxy on server line used as rhttp.
MEDIUM: rhttp: create session for active preconnect
Modify rhttp preconnect by instantiating a new session for each
connection attempt. Connection is thus linked to a session directly on
its instantiation contrary to previously where no session existed until
listener_accept().
This patch will allow to extend rhttp usage. Most notably, it will be
useful to use various sample fetches on the server line and extend
logging capabilities.
Changes are minimal, yet consequences are considered not trivial as for
the first time a FE connection session is instantiated before
listener_accept(). This requires an extra explicit check in
session_accept_fd() to not overwrite an existing session. Also, flag
SESS_FL_RELEASE_LI is not set immediately as listener counters must note
be decremented if connection and its session are freed before reversal
is completed, or else listener counters will be invalid.
conn_session_free() is used as connection destroy callback to ensure the
session will be freed automatically on connection release.
Amaury Denoyelle [Tue, 21 May 2024 14:44:26 +0000 (16:44 +0200)]
MINOR: session: define flag to explicitely release listener on free
When a session is allocated for a FE connection, session_free() is
responsible to call listener_release() to decrement listener connection
counters and resume listening.
Until now, <listener> member of session was tested inside session_free()
before invocating listener_release(). To highlight more explicitely the
relation between sessions and listeners, introduce a new flag
SESS_FL_RELEASE_LI. Only session with such flag set will invoke
listener_release() on their cleanup. Flag is set inside
session_accept_fd() on success.
This patch has no functional change. However, it will be useful to
implement session creation for rHTTP preconnect.
Amaury Denoyelle [Thu, 16 May 2024 15:37:46 +0000 (17:37 +0200)]
BUG/MINOR: rhttp: fix task_wakeup state
TASK_WOKEN_ANY was incorrectly used as argument to task_wakeup() for
rhttp preconnect task. This value is used as a flag. Replace it by
proper individual values. This is labelled as a bug but it has no known
impact.
Amaury Denoyelle [Thu, 16 May 2024 15:35:31 +0000 (17:35 +0200)]
BUG/MINOR: rhttp: prevent listener suspend
Ensure "disable frontend" on a reverse HTTP listener is forbidden by
returing -1 on suspend callback. Suspending such a listener has unknown
effect and so is not properly implemented for now.
Amaury Denoyelle [Tue, 21 May 2024 14:35:28 +0000 (16:35 +0200)]
BUG/MEDIUM: rhttp: fix preconnect on single-thread
On initialization of a rhttp bind, the first thread available on the
listener is selected to execute the first occurence of the preconnect
task.
This thread selection was incorrect as it used my_ffsl() which returns
value indexed from 1, contrary to tid which are indexed from 0. This
cause the first listener thread to be skipped in favor of the second
one. Worst, if haproxy runs in single-thread mode, calculated thread ID
will be invalid and the task will never run, which prevent any
preconnect execution.
Fix this by substracting the result of my_ffsl() by 1 to have a value
indexed from 0.
Amaury Denoyelle [Tue, 21 May 2024 09:00:37 +0000 (11:00 +0200)]
BUG/MINOR: server: free PROXY v2 TLVs on srv drop
Dynamically allocated servers PROXY TLVs were not freed on server
release. This patch fixes this leak by extending srv_free_params().
Every server line with set-proxy-v2-tlv-fmt keyword is impacted.
For static servers, issue is minimal as it will only cause leak on
deinit(). However, this could be aggravated when performing multiple
removal of dynamic servers.
Amaury Denoyelle [Tue, 14 May 2024 14:36:59 +0000 (16:36 +0200)]
BUG/MINOR: connection: parse PROXY TLV for LOCAL mode
conn_recv_proxy() is responsible to parse PROXY protocol header. For v2
of the protocol, TLVs parsing is implemented. However, this step was
only done inside 'PROXY' command label. TLVs were never extracted for
'LOCAL' command mode.
Fix this by extracting TLV parsing loop outside of the switch case. Of
notable importance, tlv_offset is updated on LOCAL label to point to
first TLV location.
This bug should be backported up to 2.9 at least. It should even
probably be backported to every stable versions. Note however that this
code has changed much over time. It may be useful to use option
'--ignore-all-space' to have a clearer overview of the git diff.
We decided to spend some time to refactor and rationnalize the SPOE for the
3.1. Thus there is no reason to still consider it as deprecated for the
3.0. Compatibility between the both versions will be maintained.
BUG/MINOR: http-ana: Don't crush stream termination condition on internal error
When internal error is reported from an HTTP analyzer, we must take care to
not set the stream termination condition if it was already set. For
instance, it happens when a message rewrite fails. In this case
SF_ERR_PXCOND is set by the rule. The HTTP analyzer must not crush it with
SF_ERR_INTERNAL.
The regression was introduced with the commit 0fd25514d6 ("MEDIUM: http-ana:
Set termination state before returning haproxy response").
The bug was discovered working in the issue #2568. It must be backported to
2.9.
To improve the readability of sock_handle_system_err(), let's
set explicitly conn->err_code as CO_ER_SOCK_ERR in case of EPERM
(could be returned by setns syscall).
BUG/MEDIUM: proto: fix fd leak in <proto>_connect_server
This fixes the fd leak, introduced in the commit d3fc982cd788
("MEDIUM: proto: make common fd checks in sock_create_server_socket").
Initially sock_create_server_socket() was designed to return only created
socket FD or -1. Its callers from upper protocol layers were required to test
the returned errno and were required then to apply different configuration
related checks to obtained positive sock_fd. A lot of this code was duplicated
among protocols implementations.
The new refactored version of sock_create_server_socket() gathers in one place
all duplicated checks, but in order to be complient with upper protocol
layers, it needs the 3rd parameter: 'stream_err', in which it sets the
Stream Error Flag for upper levels, if the obtained sock_fd has passed all
additional checks.
No backport needed since this was introduced in 3.0-dev10.
MEDIUM: ssl: don't load file by discovering them in crt-store
In commit 55e9e9591 ("MEDIUM: ssl: temporarily load files by detecting
their presence in crt-store"), ssl_sock_load_pem_into_ckch() was
replaced by ssl_sock_load_files_into_ckch() in the crt-store loading.
But the side effect was that we always try to autodetect, and this is
not what we want. This patch reverse this, and add specific code in the
crt-list loading, so we could autodetect in crt-list like it was done
before, but still try to load files when a crt-store filename keyword is
specified.
Thanks to ("MINOR: tools: add vma_set_name() helper"), set a name hint
for large arrays created by fd api (fdtab arrays and so on) so that
that they can be easily identified in /proc/<pid>/maps.
Depending on malloc() implementation, such memory areas will normally be
merged on the heap under MMAP_THRESHOLD (128 kB by default) and will
have a dedicated memory area once the threshold is exceeded. As such, when
large enough, they will appear like this in /proc/<pid>/maps:
DEBUG: errors: add name hint for startup-logs memory area
Thanks to ("MINOR: tools: add vma_set_name() helper"), set a name hint
for startup-logs ring's memory area created using mmap() so it can be
easily indentified in /proc/<pid>/maps.
DEBUG: pollers: add name hint for large memory areas used by pollers
Thanks to ("MINOR: tools: add vma_set_name() helper"), set a name hint
for large memory areas allocated by pollers upon init so that they can
be easily indentified in /proc/<pid>/maps.
For now, only linux-compatible pollers are considered since vma_set_name()
requires a recent linux kernel (>= 5.17).
Depending on malloc() implementation, such memory areas will normally be
merged on the heap under MMAP_THRESHOLD (128 kB by default) and will
have a dedicated memory area once the threshold is exceeded. As such, when
large enough, they will appear like this in /proc/<pid>/maps:
DEBUG: sink: add name hint for memory area used by memory-backed sinks
Thanks to ("MINOR: tools: add vma_set_name() helper"), set a name hint
for user created memory-backed sinks (ring sections without backing-file)
so that they can be easily indentified in /proc/<pid>/maps.
Depending on malloc() implementation, such memory areas will normally be
merged on the heap under MMAP_THRESHOLD (128 kB by default) and will
have a dedicated memory area once the threshold is exceeded. As such, when
large enough, they will appear like this in /proc/<pid>/maps:
DEBUG: shctx: name shared memory using vma_set_name()
In 98d22f212 ("MEDIUM: shctx: Naming shared memory context"), David
implemented prctl/PR_SET_VMA support to give a name to shctx maps when
supported. Maps were named after "HAProxy $name". It turns out that it
is not relevant to include "HAProxy" in the map name, given that we're
already looking at maps for a given PID (and here it's HAProxy's pid).
Instead, let's name shctx maps by making use of the new vma_set_name()
helper introduced by previous commit. Resulting maps will be named
"shctx:$name", e.g.: "shctx:globalCache", they will appear like this in
/proc/<pid>/maps:
Following David Carlier's work in 98d22f21 ("MEDIUM: shctx: Naming shared
memory context"), let's provide an helper function to set a name hint on
a virtual memory area (ie: anonymous map created using mmap(), or memory
area returned by malloc()).
Naming will only occur if available, and naming errors will be ignored.
The function takes mandatory <type> and <name> parameterss to build the
map name as follow: "type:name". When looking at /proc/<pid>/maps, vma
named using this helper function will show up this way (provided that
the kernel has prtcl support for PR_SET_VMA_ANON_NAME):
example, using mmap + MAP_SHARED|MAP_ANONYMOUS: 7364c4fff000-736508000000 rw-s 00000000 00:01 3540 [anon_shmem:type:name]
Another example, using mmap + MAP_PRIVATE|MAP_ANONYMOUS or using
glibc/malloc() above MMAP_THRESHOLD: 7364c4fff000-736508000000 rw-s 00000000 00:01 3540 [anon:type:name]
BUG/MINOR: ring: free ring's allocated area not ring's usable area when using maps
Since 40d1c84bf0 ("BUG/MAJOR: ring: free the ring storage not the ring
itself when using maps"), munmap() call for startup_logs's ring and
file-backed rings fails to work (EINVAL) and causes memory leaks during
process cleanup.
munmap() fails because it is called with the ring's usable area pointer
which is an offset from the underlying original memory block allocated
using mmap(). Indeed, ring_area() helper function was misused because
it didn't explicitly mention that the returned address corresponds to
the usable storage's area, not the allocated one.
To fix the issue, we add an explicit ring_allocated_area() helper to
return the allocated area for the ring, just like we already have
ring_allocated_size() for the allocated size, and we properly use both
the allocated size and allocated area to manipulate them using munmap()
and msync().
Willy Tarreau [Sat, 18 May 2024 14:51:23 +0000 (16:51 +0200)]
[RELEASE] Released version 3.0-dev12
Released version 3.0-dev12 with the following main changes :
- CI: drop asan.log umbrella completely
- BUG/MINOR: log: fix leak in add_sample_to_logformat_list() error path
- BUG/MINOR: log: smp_rgs array issues with inherited global log directives
- MINOR: rhttp: Don't require SSL when attach-srv name parsing
- REGTESTS: ssl: be more verbose with ocsp_compat_check.vtc
- DOC: Update UUID references to RFC 9562
- MINOR: hlua: add hlua_nb_instruction getter
- MEDIUM: hlua: take nbthread into account in hlua_get_nb_instruction()
- BUG/MEDIUM: server: clear purgeable conns before server deletion
- BUG/MINOR: mux-quic: fix error code on shutdown for non HTTP/3
- BUG/MINOR: qpack: fix error code reported on QPACK decoding failure
- BUG/MEDIUM: htx: mark htx_sl as packed since it may be realigned
- BUG/MEDIUM: stick-tables: properly mark stktable_data as packed
- SCRIPTS: run-regtests: fix a few occurrences of extended regexes
- BUG/MINOR: ssl_sock: fix xprt_set_used() to properly clear the TASK_F_USR1 bit
- MINOR: dynbuf: provide a b_dequeue() variant for multi-thread
- BUG/MEDIUM: muxes: enforce buf_wait check in takeover()
- BUG/MINOR: h1: Check authority for non-CONNECT methods only if a scheme is found
- BUG/MEDIUM: h1: Reject CONNECT request if the target has a scheme
- BUG/MAJOR: h1: Be stricter on request target validation during message parsing
- MINOR: qpack: prepare error renaming
- MINOR: h3/qpack: adjust naming for errors
- MINOR: h3: adjust error reporting on sending
- MINOR: h3: adjust error reporting on receive
- MINOR: mux-quic: support glitches
- MINOR: h3: report glitch on RFC violation
- BUILD: stick-tables: better mark the stktable_data as 32-bit aligned
- MINOR: ssl: rename tune.ssl.ocsp-update.mode in ocsp-update.mode
- REGTESTS: update the ocsp-update tests
- BUILD: stats: remove non portable getline() usage
- MEDIUM: ssl: add ocsp-update.mindelay and ocsp-update.maxdelay
- BUILD: log: get rid of non-portable strnlen() func
- BUG/MEDIUM: fd: prevent memory waste in fdtab array
- CLEANUP: compat: make the MIN/MAX macros more reliable
- Revert: MEDIUM: evports: permit to report multiple events at once"
- BUG/MINOR: stats: Don't state the 303 redirect response is chunked
- MINOR: mux-h1: Add a flag to ignore the request payload
- REORG: mux-h1: Group H1S_F_BODYLESS_* flags
- CLEANUP: mux-h1: Remove unused H1S_F_ERROR_MASK mask value
- MEDIUM: mux-h1: Support C-L/T-E header suppressions when sending messages
- MINOR: ssl: ckch_store_new_load_files_conf() loads filenames from ckch_conf
- MEDIUM: ssl/crtlist: loading crt-store keywords from a crt-list
- CLEANUP: ssl/ocsp: remove the deprecated parsing code for "ocsp-update"
- MINOR: ssl: pass ckch_store instead of ckch_data to ssl_sock_load_ocsp()
- MEDIUM: ssl: ckch_conf_parse() uses -1/0/1 for off/default/on
- MINOR: ssl: handle PARSE_TYPE_INT and PARSE_TYPE_ONOFF in ckch_store_load_files()
- MINOR: ssl/ocsp: use 'ocsp-update' in crt-store
- MINOR: ssl: ckch_conf_clean() utility function for ckch_conf
- MEDIUM: ssl: add ocsp-update.disable global option
- MEDIUM: ssl/cli: handle crt-store keywords in crt-list over the CLI
- MINOR: ssl: ckch_conf_cmp() compare multiple ckch_conf structures
- MEDIUM: ssl: temporarily load files by detecting their presence in crt-store
- REGTESTS: ocsp-update: change the reg-test to support the new crt-store mode
- DOC: capabilities: fix chapter header rendering
The header of a new management guide chapter, "13.1. Linux capabilities
support", is not rendered in HTML format in a proper way, because of missing
dots at the end of this chapter's number.
REGTESTS: ocsp-update: change the reg-test to support the new crt-store mode
Update the ocsp-update tests for the recent changes:
- Incompatibilities check string changed to match the crt-store one
- The "good configurations" are not good anymore because the
ckch_conf_cmp() does not compare anymore with a global value.
MEDIUM: ssl: temporarily load files by detecting their presence in crt-store
crt-store is maint to be stricter than your common crt argument on a
bind line, and is supposed to be a declarative format.
However, since the 'ocsp-update' was migrated from ssl_conf to
ckch_conf, the .issuer file is not autodetected anymore when adding a
ocsp-update keyword in a crt-list file, which breaks retro-compatibility.
This patch is a quick fix that will disappear once we are able to be
strict on a crt-store and autodetect on a crt-list.
The ckch_conf_cmp() function allow to compare multiple ckch_conf
structures in order to check that multiple usage of the same crt in the
configuration uses the same ckch_conf definition.
A crt-list allows to use "crt-store" keywords that defines a ckch_store,
that can lead to inconsistencies when a crt is called multiple time with
different parameters.
This function compare and dump a list of differences in the err variable
to be output as error.
The variant ckch_conf_cmp_empty() compares the ckch_conf structure to an
empty one, which is useful for bind lines, that are not able to have
crt-store keywords.
These functions are used when a crt-store is already inialized and we
need to verify if the parameters are compatible.
ckch_conf_cmp() handles multiple cases:
- When the previous ckch_conf was declared with CKCH_CONF_SET_EMPTY, we
can't define any new keyword in the next initialisation
- When the previous ckch_conf was declared with keywords in a crtlist
(CKCH_CONF_SET_CRTLIST), the next initialisation must have the exact
same keywords.
- When the previous ckch_conf was declared in a "crt-store"
(CKCH_CONF_SET_CRTSTORE), the next initialisaton could use no keyword
at all or the exact same keywords.
MEDIUM: ssl/cli: handle crt-store keywords in crt-list over the CLI
This patch adds crt-store keywords from the crt-list on the CLI.
- keywords from crt-store can be used over the CLI when inserting
certificate in a crt-list
- keywords from crt-store are dumped when showing a crt-list content
over the CLI
The ckch_conf_kws.func function pointer needed a new "cli" parameter, in
order to differenciate loading that come from the CLI or from the
startup, as they don't behave the same. For example it must not try to
load a file on the filesystem when loading a crt-list line from the CLI.
dump_crtlist_sslconf() was renamed in dump_crtlist_conf() and takes a
new ckch_conf parameter in order to dump relevant crt-store keywords.
MEDIUM: ssl: add ocsp-update.disable global option
This option allow to disable completely the ocsp-update.
To achieve this, the ocsp-update.mode global keyword don't rely anymore
on SSL_SOCK_OCSP_UPDATE_OFF during parsing to call
ssl_create_ocsp_update_task().
Instead, we will inherit the SSL_SOCK_OCSP_UPDATE_* value from
ocsp-update.mode for each certificate which does not specify its own
mode.
To disable completely the ocsp without editing all crt entries,
ocsp-update.disable is used instead of "ocsp-update.mode" which is now
only used as the default value for crt.
MINOR: ssl: pass ckch_store instead of ckch_data to ssl_sock_load_ocsp()
ssl_sock_put_ckch_into_ctx() and ssl_sock_load_ocsp() need to take a
ckch_store in argument. Indeed the ocsp_update_mode is not stored
anymore in ckch_data, but in ckch_conf which is part of the ckch_store.
This is a minor change, but the function definition had to change.
CLEANUP: ssl/ocsp: remove the deprecated parsing code for "ocsp-update"
Remove the "ocsp-update" keyword handling from the crt-list.
The code was made as an exception everywhere so we could activate the
ocsp-update for an individual certificate.
The feature will still exists but will be parsed as a "crt-store"
keyword which will still be usable in a "crt-list". This will appear in
future commits.
MEDIUM: ssl/crtlist: loading crt-store keywords from a crt-list
This patch allows the usage of "crt-store" keywords from a "crt-list".
The crtstore_parse_load() function was splitted into 2 functions, so the
keywords parsing is done in ckch_conf_parse().
With this patch, crt are loaded with ckch_store_new_load_files_conf() or
ckch_store_new_load_files_path() depending on weither or not there is a
"crt-store" keyword.
More checks need to be done on "crt" bind keywords to ensure that
keywords are compatible.
This patch does not introduce the feature on the CLI.
MINOR: ssl: ckch_store_new_load_files_conf() loads filenames from ckch_conf
ckch_store_new_load_files_conf() is the equivalent of
new_ckch_store_load_files_path() but instead of trying to find the files
using a base filename, it will load them from a list of files.
MEDIUM: mux-h1: Support C-L/T-E header suppressions when sending messages
During the 2.9 dev cycle, to be able to support zero-copy data forwarding, a
change on the H1 mux was performed to ignore the headers modifications about
payload representation (Content-Length and Transfer-Encoding headers).
It appears there are some use-cases where it could be handy to change values
of these headers or just remove them. For instance, we can imagine to remove
these headers on a server response to force the old HTTP/1.0 close mode
behavior. So thaks to this patch, the rules are relaxed. It is now possible
to remove these headers. When this happens, the following rules are applied:
* If "Content-Length" header is removed but a "Transfer-Encoding: chunked"
header is found, no special processing is performed. The message remains
chunked. However the close mode is not forced.
* If "Transfer-Encoding" header is removed but a "Content-Length" header is
found, no special processing is performed. The payload length must comply
to the specified content length.
* If one of them is removed and the other one is not found, a response is
switch the close mode and a "Content-Length: 0" header is forced on a
request.
With these rules, we fit the best to the user expectations.
This patch depends on the following commit:
* MINOR: mux-h1: Add a flag to ignore the request payload
This patch should fix the issue #2536. It should be backported it to 2.9
with the commit above.
CLEANUP: mux-h1: Remove unused H1S_F_ERROR_MASK mask value
This mask value is unused, so we can safely remove it. It is a chance
because its value was wrong. But there is no bug here, even in stable
versions, because it is no longer used in all versions.
MINOR: mux-h1: Add a flag to ignore the request payload
There was a flag to skip the response payload on output, if any, by stating
it is bodyless. It is used for responses to HEAD requests or for 204/304
responses. This allow rewrites during analysis. For instance a HEAD request
can be rewrite to a GET request for any reason (ie, a server not supporting
HEAD requests). In this case, the server will send a response with a
payload. On frontend side, the payload will be skipped and a valid response
(without payload) will be sent to the client.
With this patch we introduce the corresponding flag for the request. It will
be used to skip the request payload. In addition, when payload must be
skipped for a request or a response, The zero-copy data forwarding is now
disabled.
BUG/MINOR: stats: Don't state the 303 redirect response is chunked
Start-line flags for 303-See-Other response returned by the stats applet are
not properly set. Indeed, the reponse has a "content-length" header but both
HTX_SL_F_CHNK and HTX_SL_F_CLEN flags are set. Because of this bug, the
reponse is considered as chunked. So, let's remove HTX_SL_F_CHNK flag.
And also add HTX_SL_F_BODYLESS flag because there is no payload
("content-length" header is always set to 0).
This patch must be backported to all stable versions. On the 2.8 and lower
versions, the commit d0b04920d1 ("BUG/MINOR: htpp-ana/stats: Specify that
HTX redirect messages have a C-L header") must be backported first.
Willy Tarreau [Fri, 17 May 2024 13:53:40 +0000 (15:53 +0200)]
Revert: MEDIUM: evports: permit to report multiple events at once"
Tests have shown that switching nevlist to global.tune.maxpollevents
is totally unreliable when using evports, and that events seem to be
missed. A good reproducer seems to be QUIC. There are not enough
users of Solaris to warrant spending more time trying to get down to
this, and even the few that remain are by definition not interested
in performance, so let's just revert the commit that tried to lift the
value: e6662bf706 ("MEDIUM: evports: permit to report multiple events
at once").
Willy Tarreau [Fri, 17 May 2024 13:25:26 +0000 (15:25 +0200)]
CLEANUP: compat: make the MIN/MAX macros more reliable
After every release we say that MIN/MAX should be changed to be an
expression that only evaluates each operand once, and before every
version we forget to change it and we recheck that the code doesn't
misuse them. Let's fix them now.
BUG/MEDIUM: fd: prevent memory waste in fdtab array
In 97ea9c49f1 ("BUG/MEDIUM: fd: always align fdtab[] to 64 bytes"), the
patch doesn't do what the message says. The intent was only to align the
base fdtab addr on 64 bytes so that all fdtab entries are aligned and thus
don't share the same cache line. For that, fdtab pointer is adjusted from
fdtab_addr (unaligned) address after it is allocated. Thus, all we need
is an extra 64 bytes in the fdtab_addr array for the aligment. Because
we use calloc() to perform the allocation, a dumb mistake was made: the
'+64' was added on <size> calloc argument, which means EACH fdtab entry
is allocated with 64 extra bytes.
Given that a single fdtab entry is 64 bytes, since 97ea9c49f1 each fdtab
entry now takes 128 bytes! We doubled fdtab memory consumption.
To give you an idea, on my laptop, when looking at memory consumption
using 'ps -p `pidof haproxy` -o size' right after starting haproxy
process with default settings (no maxsock enforced):
To fix this, use calloc with 1 <nmemb> and manually provide the size with
<size> as we would do if we used malloc(). With this patch, we're back to
pre-97ea9c49f1 for fdtab memory consumption (with 64 extra bytes the
whole array, which is insignificant).
BUILD: log: get rid of non-portable strnlen() func
In c614fd3b9 ("MINOR: log: add +cbor encoding option"), I wrongly used
strnlen() without noticing that the function is not portable (requires
_POSIX_C_SOURCE >= 2008) and that it was the first occurrence in the
entire project. In fact it is not a hard requirement since it's a pretty
simple function. Thus to restore build compatibility with minimal/older
build systems, let's actually get rid of it and use an equivalent portable
code where needed (we cannot simply rely on strlen() because the string
might not be NULL terminated, we must take upstream len into account).
No backport needed (unless c614fd3b9 gets backported)
Amaury Denoyelle [Fri, 17 May 2024 12:50:12 +0000 (14:50 +0200)]
BUILD: stats: remove non portable getline() usage
getline() was used to read stats-file. However, this function is not
portable and may cause build issue on some systems. Replace it by
standard fgets().
Amaury Denoyelle [Mon, 13 May 2024 15:45:38 +0000 (17:45 +0200)]
MINOR: h3: report glitch on RFC violation
Increment glitch connection counter on every HTTP/3 or QPACK errors
which is a violation of the specification. This could be useful to get
rid early of bogus clients.
Amaury Denoyelle [Mon, 13 May 2024 07:05:27 +0000 (09:05 +0200)]
MINOR: mux-quic: support glitches
Implement basic support for glitches on QUIC multiplexer. This is mostly
identical too glitches for HTTP/2.
A new configuration option named tune.quic.frontend.glitches-threshold
is defined to limit the number of glitches on a connection before
closing it.
Glitches counter is incremented via qcc_report_glitch(). A new
qcc_app_ops callback <report_susp> is defined. On threshold reaching, it
allows to set an application error code to close the connection. For
HTTP/3, value H3_EXCESSIVE_LOAD is returned. If not defined, default
code INTERNAL_ERROR is used.
For the moment, no glitch are reported for QUIC or HTTP/3 usage. This
will be added in future patches as needed.
Amaury Denoyelle [Mon, 13 May 2024 15:44:54 +0000 (17:44 +0200)]
MINOR: h3: adjust error reporting on receive
This commit is the second step to simplify HTTP/3 error management. This
times it deals with receive side on h3_rcv_buf().
Various internal HTTP/3 to HTX conversion functions does not set
H3_INTERNAL_ERROR on h3c err anymore. Only standard error code are set.
For every errors, both internal and protocol ones, a negative value is
returned. This ensure that h3_rcv_buf() looping is interrupted. This
function will then set H3_INTERNAL_ERROR only if no standard error is
registered via h3c or h3s.
Along the previous commit, this should better reflect internal errors
from protocol ones caused by a faulty client.
Amaury Denoyelle [Mon, 13 May 2024 15:27:26 +0000 (17:27 +0200)]
MINOR: h3: adjust error reporting on sending
It's currently difficult to differentiate HTTP/3 standard protocol
violation from internal issues which use solely H3_INTERNAL_ERROR code.
This patch aims is the first step to simplify this. The objective is to
reduce H3_INTERNAL_ERROR. <err> field of h3c should be reserved
exclusively to other values.
Simplify error management in sending via h3_snd_buf(). Sending side is
straightforward as only internal errors can be encountered. Do not
manually set h3c.err to H3_INTERNAL_ERROR in HTX to HTTP/3 various
conversion function. Instead, just return a negative value which is
enough to break h3_snd_buf() loop. H3_INTERNAL_ERROR is thus positionned
on a single location in this function for all sending operations.
Amaury Denoyelle [Fri, 10 May 2024 16:12:15 +0000 (18:12 +0200)]
MINOR: h3/qpack: adjust naming for errors
Rename enum values used for HTTP/3 and QPACK RFC defined codes. First
uses a prefix H3_ERR_* which serves as identifier between them. Also
separate QPACK values in a new dedicated enum qpack_err. This is deemed
cleaner.
Amaury Denoyelle [Fri, 10 May 2024 16:17:41 +0000 (18:17 +0200)]
MINOR: qpack: prepare error renaming
There is two distinct enums both related to QPACK error management. The
first one is dedicated to RFC defined code. The other one is a set of
internal values returned by qpack_decode_fs(). There has been issues
discovered recently due to the confusion between them.
Rename internal values with the prefix QPACK_RET_*. The older name
QPACK_ERR_* will be used in a future commit for the first enum.
BUG/MAJOR: h1: Be stricter on request target validation during message parsing
As stated in issue #2565, checks on the request target during H1 message
parsing are not good enough. Invalid paths, not starting by a slash are in
fact parsed as authorities. The same error is repeated at the sample fetch
level. This last point is annoying because routing rules may be fooled. It
is also an issue when the URI or the Host header are updated.
Because the error is repeated at different places, it must be fixed. We
cannot be lax by arguing it is the server's job to accept or reject invalid
request targets. With this patch, we strengthen the checks performed on the
request target during H1 parsing. Idea is to reject invalid requests at this
step to be sure it is safe to manipulate the path or the authority at other
places.
So now, the asterisk-form is only allowed for OPTIONS and OTHER methods.
This last point was added to not reject the H2 preface. In addition, we take
care to have only one asterisk and nothing more. For the CONNECT method, we
take care to have a valid authority-form. All other form are rejected. The
authority-form is now only supported for CONNECT method. No specific check
is performed on the origin-form (except for the CONNECT method). For the
absolute-form, we take care to have a scheme and a valid authority.
These checks are not perfect but should be good enough to properly identify
each part of the request target for a relative small cost. But, it is a
breaking change. Some requests are now be rejected while they was not on
older versions. However, nowadays, it is most probably not an issue. If it
turns out it's really an issue for legitimate use-cases, an option would be
to supports these kinds of requests when the "accept-invalid-http-request"
option is set, with the consequence of seeing some sample fetches having an
unexpected behavior.
This patch should fix the issue #2665. It MUST NOT be backported. First
because it is a breaking change. And then because by avoiding backporting
it, it remains possible to relax the parsing with the
"accept-invalid-http-request" option.
BUG/MINOR: h1: Check authority for non-CONNECT methods only if a scheme is found
When a non-CONNECT H1 request is parsed, the authority is compared to the
host header value, to validate that they are the same. However there is an
issue here when a relative path is used (not begining with a '/'). In this
case, the path is considered as the authority and will be erroneously
compared to the host header value. It is observable with this kind of
request:
GET admin HTTP/1.1
Host: www.mysite.com
In this case "admin" is parsed as an authority while it is in fact a path.
At this step, it is not a big deal because it just happens on the very first
checks on the message during the parsing. However, the same happens when the
authority is updated. This will be fixed in another commit
Note this kind of request is invalid because the path does not start with a
'/'. But, till now, HAProxy does not reject it.
This patch is related to issue #2565. It must be backported as far as 2.4.
Willy Tarreau [Tue, 14 May 2024 17:26:44 +0000 (19:26 +0200)]
BUG/MEDIUM: muxes: enforce buf_wait check in takeover()
The ->takeover() is quite tricky. It didn't take care of the possibility
that the original thread's connection handler had been woken up to handle
an event (e.g. read0), failed to get a buffer, registered against its own
thread's buffer_wait queue and left the connection in an idle state.
A new thread could then come by, perform a takeover(), and when a buffer
was available, the new thread's tasklet would be woken up by the old one
via *_buf_available(), causing all sort of problems. These problems are
easy to reproduce, by running with shared backend connections and few
buffers (tune.buffers.limit=20, 8 threads, 500 connections, transfer
64kB objects and wait 2-5s for a crash to appear).
A first estimated solution consisted in removing the connection from the
idle list but it turns out that it would be worse for the delete stuff
(the connection no longer appearing as idle, making it impossible to find
it in order to close it). Also, idle counts wouldn't match anymore the
list's state, and the special case of private connections could be
difficult to handle as the connection could be forcefully re-added to the
idle list after allocation despite being private.
After multiple attempts to address the problem in various ways, it appears
that the only reliable solution for now (without starting to turn many
lists to mt_lists) is to have the takeover() function handle the buf_wait
detection or unregistration itself:
- when doing a regular takeover aiming at finding an idle connection
for a new request, connections that are blocked in a buffer_wait
queue are quite rare and not interesting at all (since not immediately
usable), so skipping them is sufficient. For this we detect that the
desired connection belongs to a buffer_wait list by checking its
buf_wait.list element. Note that this check is *not* thread-safe! The
LIST_DEL_INIT() is performed by __offer_buffers() after the callback
was called. But this is sufficient as it is now because the only way
for the element to be seen as not in a list is after the element was
last touched by __offer_buffers(), so the situation for this connection
will not change in a different way later.
- when doing a server delete, we're running under thread isolation.
The connection might get taken over to be killed. The only trick is
that private connections not belonging to any idle list may also
experience this, and in this case even the idle_conns lock will not
offer any protection against anything. But since we're run under
thread isolation, we're certain not to compete with the other thread,
so it's safe to directly unregister the connection from its owner
thread. Normally this is already handled by conn_release() in
cli_parse_delete_server(), which calls mux->destroy(), but this would
actually update the current thread's queue instead of the origin
thread's, thus we do need to perform an explicit dequeue before
completing the takeover.
With this, the problem now looks solved for HTTP/1, HTTP/2 and FCGI,
though extensive tests were essentially run on HTTP/1 and HTTP/2.
While the problem has been there for a very long time, there should be
no reason to backport it since buffer_wait didn't practically work
before 3.0-dev and the process used to freeze hard very quickly before
we'd even have a chance to meet that race.
Willy Tarreau [Tue, 14 May 2024 17:19:23 +0000 (19:19 +0200)]
MINOR: dynbuf: provide a b_dequeue() variant for multi-thread
In order to forcefully unregister a buffer waiter during an inter-thread
takeover under isolation, we'll need to that the function works without
th_ctx but the target thread's ctx instead. Let's implement this by
passing the target thread as an argument. Now b_dequeue() simply calls
this one with tid. It's OK it's not on that critical a path, especially
since the list has been checked for existence before performing the call.
Willy Tarreau [Tue, 14 May 2024 16:54:20 +0000 (18:54 +0200)]
BUG/MINOR: ssl_sock: fix xprt_set_used() to properly clear the TASK_F_USR1 bit
In 2.4-dev8 with commit 5c7086f6b0 ("MEDIUM: connection: protect idle
conn lists with locks"), the idle conns list started to be protected
using the lock for takeover, and the SSL layer used to always take
that lock. Later in 2.4-dev11, with commit 4149168255 ("MEDIUM: ssl:
implement xprt_set_used and xprt_set_idle to relax context checks"), we
decided to relax this lock using TASK_F_USR1 just as is done in muxes.
However the xprt_set_used() call, that's supposed to clear the flag,
visibly suffered from a copy-paste and kept the OR operation instead of
the AND, resulting in the flag never being released, so that SSL on the
backend continues to take the lock on each and every I/O access even when
the connection is not idle.
The effect is only a reduced performance. This could be backported, but
given the non-zero risk of triggering another bug somewhere, it would
be prudent to wait for this fix to be sufficiently tested in new
versions first.
Willy Tarreau [Wed, 15 May 2024 17:33:45 +0000 (19:33 +0200)]
SCRIPTS: run-regtests: fix a few occurrences of extended regexes
Running run-regtests on OpenBSD failed to identify haproxy version and
the various build options because the backslash is not recognized in
grep expressions. One must only use -E for the extended regexes and
not use the slash.
Willy Tarreau [Wed, 15 May 2024 14:22:23 +0000 (16:22 +0200)]
BUG/MEDIUM: stick-tables: properly mark stktable_data as packed
The stktable_data union is made of types of varying sizes, and depending
on which types are stored in a table, some offsets might not necessarily
be aligned. This results in a bus error for certain regtests (e.g.
lb-services) on MIPS64. This bug may impact MIPS64, SPARC64, armv7 when
accessing a 64-bit counter (e.g. bytes) and depending on how the compiler
emitted the operation, and cause a trap that's emulated by the OS on RISCV
(heavy cost). x86_64 and armv8 are not affected at all.
Let's properly mark the struct with __attribute__((packed)) so that the
compiler emits the suitable unaligned-compatible instructions when
accessing the fields.
This should be backported to all versions where it applies.
Willy Tarreau [Wed, 15 May 2024 16:23:18 +0000 (18:23 +0200)]
BUG/MEDIUM: htx: mark htx_sl as packed since it may be realigned
A test on MIPS64 revealed that the following reg tests would all
fail at the same place in htx_replace_stline() when updating
parts of the request line:
reg-tests/cache/if-modified-since.vtc
reg-tests/http-rules/h1or2_to_h1c.vtc
reg-tests/http-rules/http_after_response.vtc
reg-tests/http-rules/normalize_uri.vtc
reg-tests/http-rules/path_and_pathq.vtc
While the status line is normally aligned since it's the first block
of the HTX, it may become unaligned once replaced. The problem is, it
is a structure which contains some u16 and u32, and dereferencing them
on machines not natively supporting unaligned accesses makes them crash
or handle crap. Typically, MIPS/MIPS64/SPARC will crash, ARMv5 will
either crash or (more likely) return swapped values and do crap, and
RISCV will trap and turn to slow emulation.
We can assign the htx_sl struct the packed attribute, but then this
also causes the ints to fill the 2-bytes gap before them, always causing
unaligned accesses for this part on such machines. The patch does a bit
better, by explicitly filling this two-bytes hole, and packing the
struct.
Amaury Denoyelle [Mon, 13 May 2024 14:01:08 +0000 (16:01 +0200)]
BUG/MINOR: qpack: fix error code reported on QPACK decoding failure
qpack_decode_fs() is used to decode QPACK field section on HTTP/3
headers parsing. Its return value is incoherent as it returns either
QPACK_DECOMPRESSION_FAILED defined in RFC 9204 or any other internal
values defined in qpack-dec.h. On failure, such return code is reused by
HTTP/3 layer to be reported via a CONNECTION_CLOSE frame. This is
incorrect if an internal error values was reported as it is not defined
by any specification.
Fir return values of qpack_decode_fs() in two ways. Firstly, fix invalid
usages of QPACK_DECOMPRESSION_FAILED when decoded content is too large
for the correct internal error QPACK_ERR_TOO_LARGE.
Secondly, adjust qpack_decode_fs() API to only returns internal code
values. A new internal enum QPACK_ERR_DECOMP is defined to replace
QPACK_DECOMPRESSION_FAILED. Caller is responsible to convert it to a
suitable error value. For other internal values, H3_INTERNAL_ERROR is
used. This is done through a set of convert functions.
This should be backported up to 2.6. Note that trailers are not
supported in 2.6 so chunk related to h3_trailers_to_htx() can be safely
skipped.
Amaury Denoyelle [Mon, 13 May 2024 07:02:47 +0000 (09:02 +0200)]
BUG/MINOR: mux-quic: fix error code on shutdown for non HTTP/3
qcc_shutdown() is called whenever the connection must be closed. If
application protocol defined its owned shutdown callback, it is invoked
to use the correct error code. Else transport error code NO_ERROR is
used.
A bug occurs in the latter case as NO_ERROR is used with quic_err_app()
which is reserved for application errro codes. This will trigger the
emission of a CONNECTION_CLOSE of type 0x1d (Application) instead of
0x1c (Transport).
This bug is considered minor as it does not impact QUIC with HTTP/3. It
may only be visible when using experimental HTTP/0.9 protocol.
This should be backported up to 2.6. For 2.6, patch must be completed
rewritten due to code differences. Here is the change to apply :
diff --git a/src/mux_quic.c b/src/mux_quic.c
index 26fb70ddf..c48f82e27 100644
--- a/src/mux_quic.c
+++ b/src/mux_quic.c
@@ -1918,7 +1918,9 @@ static void qc_release(struct qcc *qcc)
qc_send(qcc);
}
else {
- qcc_emit_cc_app(qcc, QC_ERR_NO_ERROR, 0);
+ /* Duplicate from qcc_emit_cc_app() for Transport error code. */
+ if (!(qcc->conn->handle.qc->flags & QUIC_FL_CONN_IMMEDIATE_CLOSE))
+ qcc->conn->handle.qc->err = quic_err_transport(QC_ERR_NO_ERROR);
}
}
Amaury Denoyelle [Wed, 15 May 2024 12:28:21 +0000 (14:28 +0200)]
BUG/MEDIUM: server: clear purgeable conns before server deletion
Since the following commit, idle connections are cleared before a server
is deleted. This is better than blocking server deletion due to inactive
connections :
A BUG_ON() has been added to ensure that server idle conn counter is nul
after these connections are removed. However, Willy managed to trigger
it easily by repeatedly and randomly delete servers accross a
single-thread haproxy using a server-template with 1000 instances. In
parallel, a h1load client is executed to generate traffic.
This BUG_ON() reflected that it some connections referencing the server
targetted for deletion remained, even though idle server list is empty.
In fact, this is caused by connections scheduled for purging. These
connections are moved from idle server list to a global toremove_list
while still being accounted by the server.
A first approach could be to decrement server idle counter while moving
connection to the purge list. However, this is functionnaly incorrect as
these purgeable connections still reference the server and it could
cause a crash if cleared after it.
The correct fix for this issue is simply to remove every purgeable
connections before a server is deleted. This is implemented by this
patch by extending cli_parse_delete_server(). It could be enough to only
remove connections targetted the deleted server, but as these
connections will be purged anyway it is justified to clear the whole
list.
This must not be backported, unless the above mentionned patch is.
MEDIUM: hlua: take nbthread into account in hlua_get_nb_instruction()
Based on Willy's idea (from 3.0-dev6 announcement message): in this patch
we try to reduce the max latency that can be caused by running lua scripts
with default settings.
Indeed, by default, hlua engine is allowed to process up to 10k
instructions per batch. While this value was found to be the optimal one
for a single thread, it turns out that keeping a thread busy for 10k lua
instructions could increase thread contention. This is especially true
when the script is loaded with 'lua-load', because in that case the
current thread owns the main lua lock and prevent other threads from
making any progress if they're also waiting on the main lock.
Thanks to Thierry Fournier's work, we know that performance-wise we can
reach optimal performance by sticking between 500 and 10k instructions
per batch. Given that, when the script is loaded using 'lua-load', if no
"tune.lua.forced-yield" was set by the user, we automatically divide the
default value (10K) by the number of threads haproxy can use to reduce
thread contention (given that all threads could compete for the main lua
lock), however we make sure not to return a value below 500, because
Thierry's work showed that this would come with a significant performance
loss.
The historical behavior may still be enforced by setting
"tune.lua.forced-yield" to 10000 in the global config section.
William Manley [Wed, 8 May 2024 10:43:11 +0000 (11:43 +0100)]
MINOR: rhttp: Don't require SSL when attach-srv name parsing
An attach-srv config line usually looks like this:
tcp-request session attach-srv be/srv name ssl_c_s_dn(CN)
while a rhttp server line usually looks like this:
server srv rhttp@ sni req.hdr(host)
The server sni argument is used as a key for looking up connection in
the connection pool. The attach-srv name argument is used as a key for
inserting connections into the pool. For it to work correctly they must
match. There was a check that either both the attach-srv and server
provide that key or neither does.
It also checked that SSL and SNI was activated on the server. However,
thanks to current connect_server() implementation, it appears that SNI
is usable even without SSL to identify a connection in the pool. Thus,
it can be diverted from its original intent in reverse HTTP case to
serve even without SSL activated. For example, this could be useful to
use `fc_pp_unique_id` as a name expression (DISCLAIMER: note that for
now PROXY protocol is not compatible with rhttp).
Error is still reported if either SNI or name is used without the other.
This patch adjust the message to a more helpful one.
Arguably it would be easier to understand if instead of using `name` and
`sni` for `attach-srv` and `server` rules it used the same term in both
places - like "conn-pool-key" or something. That would make it clear
that the two must match.