MEDIUM: ssl: {ca,crt}-ignore-err can now use error constant name
The ca-ignore-err and crt-ignore-err directives are now able to use the
openssl X509_V_ERR constant names instead of the numerical values.
This allow a configuration to survive an OpenSSL upgrade, because the
numerical ID can change between versions. For example
X509_V_ERR_INVALID_CA was 24 in OpenSSL 1 and is 79 in OpenSSL 3.
The list of errors must be updated when a new major OpenSSL version is
released.
The CRT and CA verify error codes were stored in 6 bits each in the
xprt_st field of the ssl_sock_ctx meaning that only error code up to 63
could be stored. Likewise, the ca-ignore-err and crt-ignore-err options
relied on two unsigned long longs that were used as bitfields for all
the ignored error codes. On the latest OpenSSL1.1.1 and with OpenSSLv3
and newer, verify errors have exceeded this value so these two storages
must be increased. The error codes will now be stored on 7 bits each and
the ignore-err bitfields are replaced by a big enough array and
dedicated bit get and set functions.
It can be backported on all stable branches.
[wla: let it be tested a little while before backport] Signed-off-by: William Lallemand <wlallemand@haproxy.org>
When running HAProxy with OpenSSLv3, the two BIGNUMs used to build our
own DH parameters are not freed. It was not necessary previously because
ownership of those parameters was transferred to OpenSSL through the
DH_set0_pqg call.
BUG/MINOR: httpclient: fixed memory allocation for the SSL ca_file
The memory for the SSL ca_file was allocated only once (in the function
httpclient_create_proxy()) and that pointer was assigned to each created
proxy that the HTTP client uses. This would not be a problem if this
memory was not freed in each individual proxy when it was deinitialized
in the function ssl_sock_free_srv_ctx().
Memory allocation:
src/http_client.c, function httpclient_create_proxy():
1277: if (!httpclient_ssl_ca_file)
1278: httpclient_ssl_ca_file = strdup("@system-ca");
1280: srv_ssl->ssl_ctx.ca_file = httpclient_ssl_ca_file;
Memory deallocation:
src/ssl_sock.c, function ssl_sock_free_srv_ctx():
5613: ha_free(&srv->ssl_ctx.ca_file);
Amaury Denoyelle [Tue, 25 Oct 2022 09:38:21 +0000 (11:38 +0200)]
BUG/MINOR: quic: fix race condition on datagram purging
Each datagram is received by a random thread and dispatch to its
destination thread linked to the connection. Then, the datagram is
handled by the connection thread. Once this is done, datagram buffer
pointer is atomically set to NULL to mark it as consumed.
Consumed datagrams are purged before recvfrom() invocation on random
receiver threads. The check for NULL buffer must thus be done
atomically. This was not the case before this patch, which may have
triggered race conditions.
Amaury Denoyelle [Thu, 27 Oct 2022 15:56:27 +0000 (17:56 +0200)]
MINOR: quic: add counter for interrupted reception
Add a new counter "quic_rxbuf_full". It is incremented each time
quic_sock_fd_iocb() is interrupted on full buffer.
This should help to debug github issue #1903. It is suspected that
QUIC receiver buffers are full which in turn cause quic_sock_fd_iocb()
to be called repeatedly resulting in a high CPU consumption.
BUG/MINOR: log: fixing bug in tcp syslog_io_handler Octet-Counting
syslog_io_handler does specific treatment to handle syslog tcp octet
counting:
Logic was good, but a sneaky mistake prevented
rfc-6587 octet counting from working properly.
trash.area was used as an input buffer.
It does not make sense here since it is uninitialized.
Compilation was unaffected because trash is a thread
local "global" variable.
Subscribing was not properly designed between quic-conn and quic MUX
layers. Align this as with in other haproxy components : <subs> field is
moved from the MUX to the quic-conn structure. All mention of qcc MUX is
cleaned up in quic_conn_subscribe()/quic_conn_unsubscribe().
Thanks to this change, ACK reception notification has been simplified.
It's now unnecessary to check for the MUX existence before waking it.
Instead, if <subs> quic-conn field is set, just wake-up the upper layer
tasklet without mentionning MUX. This should probably be extended to
other part in quic-conn code.
A specialized listener accept was previously used for QUIC. This is now
unneeded and we can revert to the default one session_accept_fd().
One change of importance is that the call order between
conn_xprt_start() and conn_complete_session() is now reverted to the
default one. This means that MUX instance is now NULL during
qc_xprt_start() and its app-ops layer cannot be set here. This operation
has been delayed to qc_init() to prevent a segfault.
BUG/MAJOR: stick-table: don't process store-response rules for applets
The commit bc7c207f74 ("BUG/MAJOR: stick-tables: do not try to index a
server name for applets") tried to catch applets case when we tried to index
the server name. However, there is still an issue. The applets are
unconditionally casted to servers and this bug exists since a while. it's
just luck if it doesn't crash.
Now, when store rules are processed, we skip the rule if the stream's target
is not a server or, of course, if it is a server but the "non-stick" option
is set. However, we still take care to release the sticky session.
This patch must be backported to all stable versions.
The error check on certificate chain was ignoring all decoding error,
silently ignoring some errors.
This patch fixes the issue by being stricter on errors when reading the
chain, this is a change of behavior, it could break existing setup that
has a wrong chain.
MINOR: ssl: add the SSL error string before the chain
Add the SSL error string when failing to load a certificate in
ssl_sock_load_pem_into_ckch(). It's difficult to know what happen when no
descriptive errror are emitted. This one is for the certificate before
trying to load the complete chain.
MINOR: ssl: add the SSL error string when failing to load a certificate
Add the SSL error string when failing to load a certificate in
ssl_sock_load_pem_into_ckch(). It's difficult to know what happen when no
descriptive errror are emitted.
Example:
[ALERT] (1264006) : config : parsing [ssl_default_server.cfg:51] : 'bind /tmp/ssl.sock' in section 'listen' : unable to load certificate chain from file 'reg-tests/ssl//common.pem': ASN no PEM Header Error
BUG/MINOR: sink: Set default connect/server timeout for implicit ring buffers
Ring buffers may be implicitly created from log declarations when "tcp@",
"tcp6@", "tcp4@" or "uxst@" prefixes are used. These ring buffers rely on
unconfigurable proxies. While connect and server timeouts should be defined for
explicit ring buffers, it is no possible for implicit ones. However, a default
value must be set and TICK_ETERNITY is not an acceptable one.
Thus, now "1s" is set for the connect timeout and "5s" is set for server one.
BUG/MINOR: sink: Only use backend capability for the sink proxies
When a ring section is parsed, a proxy is created. For now, it has the
frontend (PR_CAP_FE) and the internal (PR_CAP_INT) capabilities, in addition
to the expected backend capability (PR_CAP_BE).
PR_CAP_INT capability was added to silent warning triggered because of
PR_CAP_FE capability. Indeed, Because the proxy is declared as a frontend,
warnings about missing bind lines and missing client timeout should be
triggered during the configuration parsing. These warnings are inhibited
because PR_CAP_INT capability is set. It is an issue on the 2.4 because
PR_CAP_INT capability does not exist. So warnings are always emitted.
But the true bug is that these proxies should not have PR_CAP_FE and
PR_CAP_INT capabilities. Removing these capabilities is enough to remove any
warnings on the 2.4, with no regression on higher versions. However, it may
be a good idea to eval if a dedicated frontend for sinks should be added or
not. This way, a true frontend would be used to start the sink applets. In
addition, proxies capabilities/modes have to be reviewed to have a less
ambiguous API. For instance a dedicate mode for sinks (PR_MODE_SINK ?) may
be added. Finally, it could be very nice to have all proxies in the same
list, including internal ones.
This patch should fix the issue #1900. It must be backported as far as 2.4.
Emeric Brun [Mon, 24 Oct 2022 08:04:59 +0000 (10:04 +0200)]
MINOR: peers: handle multiple resync requests using shards
We considered the resync process is finished if a full resync request
is ended receiving the "resync-finish" message. But in the case of
"shards" each node declared with a "shard" has only a partial view
of the table. And the resync process is ended whereas the original
peer tables content contains only a "shard" of the full content.
This patch allow to retrieve the entire tables requesting a resync
from all different "shards".
To do so we don't commit the end of a resync process receiving a
"resync-finish" if the node is part of "shard", we only flag this
peer and all peers using the same shard as "notup2date" as if we
received a "resync-partial" message, and we re-schedule a request
of a resync as it is done receiving a "resync-partial" message.
Doing this the peers flagged "notup2date" won't be addressed for the
next resync request round and the next resync request will be send to
a shard not yet requested.
Receving a "resync-finish" message we also check if all peers using
"shards" are flagged "notup2date". It meens that all peers have been
addressed and we can considered the resync process is now finished.
Note also that the "resync request" scheduler already handle a timeout
and if we are not able to retrieve a full resync after a delay. The
resync process is ended.
This patch should be backported in all versions handling "shard"
on peer lines.
Add "shards" new keyword for "peers" section to configure the number
of peer shards attached to such secions. This impact all the stick-tables
attached to the section.
Add "shard" new "server" parameter to configure the peers which participate to
all the stick-tables contents distribution. Each peer receive the stick-tables updates
only for keys with this shard value as distribution hash. The "shard" value
is stored in ->shard new server struct member.
cfg_parse_peers() which is the function which is called to parse all
the lines of a "peers" section is modified to parse the "shards" parameter
stored in ->nb_shards new peers struct member.
Add srv_parse_shard() new callback into server.c to pare the "shard"
parameter.
Implement stksess_getkey_hash() to compute the distribution hash for a
stick-table key as the 64-bits xxhash of the key concatenated to the stick-table
name. This function is called by stksess_setkey_shard(), itself
called by the already implemented function which create a new stick-table
key (stksess_new()).
Add ->idlen new stktable struct member to store the stick-table name length
to not have to compute it each time a stick-table key hash is computed.
BUG/MEDIUM: compression: handle rewrite errors when updating response headers
When an HTTP response is compressed by HAProxy, the headers are updated.
However it is possible to encounter a rewrite error because the buffer is
full. In this case, the compression is aborted. Thus, we must be sure to
leave the response in a valid state.
For now, it is an issue because the "Content-Encoding" header is added
before all other headers manipulations. So if the compression is aborted on
error, the "Content-Encoding" header may remain while the payload is not
compressed.
So now, we take care to leave with a valid response on error by reordering
the headers manipulations. It is too painful to really rollback all changes,
especially for an edge case.
This patch should be backported as far as 2.0. Note that on the 2.0, the
legacy HTTP part is also concerned.
Amaury Denoyelle [Fri, 21 Oct 2022 15:02:18 +0000 (17:02 +0200)]
BUG/MINOR: mux-quic: complete flow-control for uni streams
Max stream data was not enforced and respect for local/remote uni
streams. Previously, qcs instances incorrectly reused the limit defined
from bidirectional ones.
This is now fixed. Two fields are added in qcc structure connection :
* value for local flow control to enforce on remote uni streams
* value for remote flow control to respect on local uni streams
These two values can be reused to properly initialized msd field of a
qcs instance in qcs_new(). The rest of the code is similar.
This macro may be used to append an item to an existing
list, like MT_LIST_APPEND.
But here the item will be forced into locked/busy state
prior to appending, so that it is already referenced
in the list while still preventing concurrent accesses
until we decide to unlock it.
The macro returns a struct mt_list "np", that is needed
at unlock time using regular MT_LIST_UNLOCK_ELT macro.
MT_LIST_LOCK_ELT macro was documented with an ambiguous
usage restriction, implying that concurrent list deletion
was not supported.
But it seems that either the code has evolved, or the comment is
wrong because the locking behavior implemented here is exactly
the same one used in MT_LIST_DELETE, and no such restriction is
described for MT_LIST_DELETE.
I made some tests to make sure concurrent MT_LIST_DELETE (or deletion
from mt_list_for_each_entry_safe) don't cause unexepected results.
At the present time, this macro is not used, this fix only
targets upcoming developments that might rely on this.
MINOR: mworker/cli: does no try to dump the startup-logs w/o USE_SHM_OPEN
When haproxy is compiled without USE_SHM_OPEN, does not try to dump the
startup-logs in the "reload" output, because it won't show anything
interesting.
CI: github: dump the backtrace of coredumps in the alpine container
This patch allows to show the backtrace of a coredump produced in the
alpine/musl jobs.
It activates some option required by the containers to allow the
production of coredump, set a shared directory so the kernel could dump
the coredump within the container. Some debug packages were also added.
REGTESTS: httpclient/lua: test the lua task timeout with the httpclient
Test the httpclient when the lua action timeout. The lua timeout is
reached before the httpclient is ended. This test that the httpclient
are correctly cleaned when destroying the hlua context.
BUG/MEDIUM: httpclient: check if the httpclient was released in the IO handler
Upon a applet_release(), the applet can be scheduled again and a call to
the IO handler is still possible. When the struct httpclient is already
free the IO handler could try to access it.
This patch fixes the issue by setting svcctx to NULL in the
applet_release, and checking its value in the IO handler.
BUG/MEDIUM: httpclient/lua: crash when the lua task timeout before the httpclient
When the lua task finished before the httpclient that are associated to
it, there is a risk that the httpclient try to task_wakeup() the lua
task which does not exist anymore.
To fix this issue the httpclient used in a lua task are stored in a
list, and the httpclient are destroyed at the end of the lua task.
The connect timeout in a ring section was not properly parsed. Thus, it was
never set and the server timeout may be overwritten, depending on the
directives order. The first char of the keyword must be tested, not the
third one.
This patch is related to the issue #1900. But it does not fix the issue. It
must be backported as far as 2.4.
BUG/MINOR: log: Preserve message facility when the log target is a ring buffer
When a ring is used as log target, the original facility, if any, must be
preserved. The default facility must only be used if there no facility was
found in the incoming log message.
This patch should fix the issue #1901. It must be backported as far as 2.4.
Amaury Denoyelle [Mon, 17 Oct 2022 09:13:07 +0000 (11:13 +0200)]
MINOR: quic: extend Retry token check function
On Initial packet reception, token is checked for validity through
quic_retry_token_check() function. However, some related parts were left
in the parent function quic_rx_pkt_retrieve_conn(). Move this code
directly into quic_retry_token_check() to facilitate its call in various
context.
The API of quic_retry_token_check() has also been refactored. Instead of
working on a plain char* buffer, it now uses a quic_rx_packet instance.
This helps to reduce the number of parameters.
This change will allow to check Retry token even if data were received
with a FD-owned quic-conn socket. Indeed, in this case,
quic_rx_pkt_retrieve_conn() call will probably be skipped.
Amaury Denoyelle [Mon, 17 Oct 2022 10:04:49 +0000 (12:04 +0200)]
MINOR: quic: refactor packet drop on reception
Sometimes, a packet is dropped on reception. Several goto statements are
used, mostly to increment a proxy drop counter or drop silently the
packet. However, this labels are interleaved. Re-arrang goto labels to
simplify this process :
* drop label is used to drop a packet with counter incrementation. This
is the default method.
* drop_silent is the next label which does the same thing but skip the
counter incrementation. This is useful when we do not need to report
the packet dropping operation.
Amaury Denoyelle [Wed, 19 Oct 2022 13:37:44 +0000 (15:37 +0200)]
MINOR: quic: split and rename qc_lstnr_pkt_rcv()
This change is the following of qc_lstnr_pkt_rcv() refactoring. This
function has finally been split into several ones.
The first half is renamed quic_rx_pkt_parse(). This function is
responsible to parse a QUIC packet header and calculate the packet
length.
QUIC connection retrieval has been extracted and is now called directly
by quic_lstnr_dghdlr().
The second half of qc_lstnr_pkt_rcv() is renamed to qc_rx_pkt_handle().
This function is responsible to copy a QUIC packet content to a
quic-conn receive buffer.
A third function named qc_rx_check_closing() is responsible to detect if
the connection is already in closing state. As this requires to drop the
whole datagram, it seems justified to be in a separate function.
This change has no functional impact. It is part of a refactoring series
on qc_lstnr_pkt_rcv(). The objective is to facilitate the integration of
FD-owned quic-conn socket patches.
Amaury Denoyelle [Wed, 19 Oct 2022 13:28:44 +0000 (15:28 +0200)]
MINOR: quic: extract connection retrieval
Simplify qc_lstnr_pkt_rcv() by extracting code responsible to retrieve
the quic-conn instance. This code is put in a dedicated function named
quic_rx_pkt_retrieve_conn(). This new function could be skipped if a
FD-owned quic-conn socket is used.
The first traces of qc_lstnr_pkt_rcv() have been clean up as qc instance
is always NULL here : thus qc parameter can be removed without any
change.
This change has no functional impact. It is a part of a refactoring
series on qc_lstnr_pkt_rcv(). The objective is facilitate integration of
FD-owned socket patches.
Amaury Denoyelle [Wed, 19 Oct 2022 15:14:28 +0000 (17:14 +0200)]
MINOR: quic: define first packet flag
Received packets treatment has some difference regarding if this is the
first one or not of the encapsulating datagram. Previously, this was set
via a function argument. Simplify this by defining a new Rx packet flag
named QUIC_FL_RX_PACKET_DGRAM_FIRST.
This change does not have functional impact. It will simplify API when
qc_lstnr_pkt_rcv() is broken into several functions : their number of
arguments will be reduced thanks to this patch.
Amaury Denoyelle [Mon, 17 Oct 2022 16:05:26 +0000 (18:05 +0200)]
MINOR: quic: extend pn_offset field from quic_rx_packet
pn_offset field was only set if header protection cannot be removed.
Extend the usage of this field : it is now set everytime on packet
parsing in qc_lstnr_pkt_rcv().
This change helps to clean up API of Rx functions by removing
unnecessary variables and function argument.
This change has no functional impact. It is a part of a refactoring
series on qc_lstnr_pkt_rcv(). The objective is facilitate integration of
FD-owned socket patches.
Amaury Denoyelle [Mon, 17 Oct 2022 16:05:18 +0000 (18:05 +0200)]
MINOR: quic: add version field on quic_rx_packet
Add a new field version on quic_rx_packet structure. This is set on
header parsing in qc_lstnr_pkt_rcv() function.
This change has no functional impact. It is a part of a refactoring
series on qc_lstnr_pkt_rcv(). The objective is facilitate integration of
FD-owned socket patches.
Amaury Denoyelle [Tue, 18 Oct 2022 09:05:02 +0000 (11:05 +0200)]
BUG/MINOR: quic: fix buffer overflow on retry token generation
When generating a Retry token, client CID is used as encryption input.
The client must reuse the same CID when emitting the token in a new
Initial packet.
A memory overflow can occur on quic_generate_retry_token() depending on
the size of client CID. This is because space reserved for <aad> only
accounted for QUIC_HAP_CID_LEN (size of haproxy owned generated CID).
However, the client CID size only depends on client parameter and is
instead limited to QUIC_CID_MAXLEN as specified in RFC9000.
This was reproduced with ngtcp2 and haproxy built with ASAN. Here is the error
log :
==14964==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fffee228cee at pc 0x7ffff785f427 bp 0x7fffee2289e0 sp 0x7fffee228188
WRITE of size 17 at 0x7fffee228cee thread T5
#0 0x7ffff785f426 in __interceptor_memcpy /usr/src/debug/gcc/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:827
#1 0x555555906ea7 in quic_generate_retry_token_aad src/quic_conn.c:5452
#2 0x555555907e72 in quic_retry_token_check src/quic_conn.c:5577
#3 0x55555590d01e in qc_lstnr_pkt_rcv src/quic_conn.c:6103
#4 0x5555559190fa in quic_lstnr_dghdlr src/quic_conn.c:7179
#5 0x555555eb0abf in run_tasks_from_lists src/task.c:590
#6 0x555555eb285f in process_runnable_tasks src/task.c:855
#7 0x555555d9118f in run_poll_loop src/haproxy.c:2853
#8 0x555555d91f88 in run_thread_poll_loop src/haproxy.c:3042
#9 0x7ffff709f8fc (/usr/lib/libc.so.6+0x868fc)
#10 0x7ffff7121a5f (/usr/lib/libc.so.6+0x108a5f)
src/qmux_trace.c:80:45: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 4 has type ‘uint64_t’ {aka ‘const long long unsigned int’} [-Werror=format=]
80 | chunk_appendf(&trace_buf, " qcs=%p .id=%lu .st=%s",
| ~~^
| |
| long unsigned int
| %llu
81 | qcs, qcs->id,
| ~~~~~~~
| |
| uint64_t {aka const long long unsigned int}
compilation terminated due to -Wfatal-errors.
Cast remaining uint64_t variables as ullong with %llu as printf format and size_t
others as ulong with %lu as printf format.
Thank you to Ilya for having reported this issue in GH #1899.
Amaury Denoyelle [Mon, 17 Oct 2022 16:46:49 +0000 (18:46 +0200)]
BUILD: ssl_sock: fix null dereference for QUIC build
A previous commit tries to fix uninitialized GCC warning on ssl code for
QUIC build. See the fix here : 48e46f98ccf97427995eb41c6f28cc38705bdd7e
BUILD: ssl_sock: bind_conf uninitialized in ssl_sock_bind_verifycbk()
However, this is incomplete as it still reports possible NULL
dereference on ctx variable (GCC v12.2.0). Here is the compilation
result :
src/ssl_sock.c: In function ‘ssl_sock_bind_verifycbk’:
src/ssl_sock.c:1739:12: error: potential null pointer dereference [-Werror=null-dereference]
1739 | ctx->xprt_st |= SSL_SOCK_ST_FL_VERIFY_DONE;
|
To fix this, remove check on qc which can also never happens and replace
it with a BUG_ON. This seems to satisfy GCC on my machine.
Willy Tarreau [Fri, 14 Oct 2022 18:45:23 +0000 (20:45 +0200)]
[RELEASE] Released version 2.7-dev8
Released version 2.7-dev8 with the following main changes :
- BUG/MINOR: checks: update pgsql regex on auth packet
- DOC: config: Fix pgsql-check documentation to make user param mandatory
- CLEANUP: mux-quic: remove usage of non-standard ull type
- CLEANUP: quic: remove global var definition in quic_tls header
- BUG/MINOR: quic: adjust quic_tls prototypes
- CLEANUP: quic: fix headers
- CLEANUP: quic: remove unused function prototype
- CLEANUP: quic: remove duplicated varint code from xprt_quic.h
- CLEANUP: quic: create a dedicated quic_conn module
- BUG/MINOR: mux-quic: ignore STOP_SENDING for locally closed stream
- BUG/MEDIUM: lua: Don't crash in hlua_lua2arg_check on failure
- BUG/MEDIUM: lua: handle stick table implicit arguments right.
- BUILD: h1: silence an initiialized warning with gcc-4.7 and -Os
- MINOR: fd: add a new function to only raise RLIMIT_NOFILE
- MINOR: init: do not try to shrink existing RLIMIT_NOFIlE
- BUG/MINOR: http-fetch: Update method after a prefetch in smp_fetch_meth()
- BUILD: http_fetch: silence an uninitiialized warning with gcc-4/5/6 at -Os
- BUG/MINOR: hlua: hlua_channel_insert_data() behavior conflicts with documentation
- MINOR: quic: limit usage of ssl_sock_ctx in favor of quic_conn
- MINOR: mux-quic: check quic-conn return code on Tx
- CLEANUP: quic: fix indentation
- MEDIUM: quic: retrieve frontend destination address
- CLEANUP: Reapply ist.cocci (2)
- CLEANUP: Reapply strcmp.cocci
- CLEANUP: quic/receiver: remove the now unused tx_qring list
- BUG/MINOR: quic: set IP_PKTINFO socket option for QUIC receivers only
- MINOR: hlua: some luaL_checktype() calls were not guarded with MAY_LJMP
- DOC: configuration: missing 'if' in tcp-request content example
- MINOR: hlua: removing ambiguous lua_pushvalue with 0 index
- BUG/MAJOR: stick-tables: do not try to index a server name for applets
- MINOR: plock: support disabling exponential back-off
- MINOR: freq_ctr: use the thread's local time whenever possible
- MEDIUM: stick-table: switch the table lock to rwlock
- MINOR: stick-table: do not take an exclusive lock when downing ref_cnt
- MINOR: stick-table: move the write lock inside stktable_touch_with_exp()
- MEDIUM: stick-table: only take the lock when needed in stktable_touch_with_exp()
- MEDIUM: stick-table: make stksess_kill_if_expired() avoid the exclusive lock
- MEDIUM: stick-table: return inserted entry in __stktable_store()
- MEDIUM: stick-table: free newly allocated stkess if it couldn't be inserted
- MEDIUM: stick-table: switch to rdlock in stktable_lookup() and lookup_key()
- MEDIUM: stick-table: make stktable_get_entry() look up under a read lock
- MEDIUM: stick-table: do not take a lock to update t->current anymore.
- MEDIUM: stick-table: make stktable_set_entry() look up under a read lock
- MEDIUM: stick-table: requeue the expiration task out of the exclusive lock
- MINOR: stick-table: split stktable_store() between key and requeue
- MEDIUM: stick-table: always use atomic ops to requeue the table's task
- MEDIUM: stick-table: requeue the wakeup task out of the write lock
- BUG/MINOR: stick-table: fix build with DEBUG_THREAD
- REORG: mux-fcgi: Extract flags and enums into mux_fcgi-t.h
- MINOR: flags/mux-fcgi: Decode FCGI connection and stream flags
- BUG/MEDIUM: mux-h1: Add connection error handling when reading/sending on a pipe
- BUG/MEDIUM: mux-h1: Handle abort with an incomplete message during parsing
- BUG/MINOR: server: make sure "show servers state" hides private bits
- MINOR: checks: use the lighter PRNG for spread checks
- MEDIUM: checks: spread the checks load over random threads
- CI: SSL: use proper version generating when "latest" semantic is used
- CI: SSL: temporarily stick to LibreSSL=3.5.3
- MINOR: quic: New quic_cstream object implementation
- MINOR: quic: Extract CRYPTO frame parsing from qc_parse_pkt_frms()
- MINOR: quic: Use a non-contiguous buffer for RX CRYPTO data
- BUG/MINOR: quic: Stalled 0RTT connections with big ClientHello TLS message
- MINOR: quic: Split the secrets key allocation in two parts
- CLEANUP: quic: remove unused rxbufs member in receiver
- CLEANUP: quic: improve naming for rxbuf/datagrams handling
- MINOR: quic: implement datagram cleanup for quic_receiver_buf
- MINOR: ring: ring_cast_from_area() cast from an allocated area
- MINOR: buffers: split b_force_xfer() into b_cpy() and b_force_xfer()
- MINOR: logs: startup-logs can use a shm for logging the reload
- MINOR: mworker/cli: reload command displays the startup-logs
- MEDIUM: quic: respect the threads assigned to a bind line
- DOC: management: update the "reload" command of the master CLI
- BUILD: ssl_sock: bind_conf uninitialized in ssl_sock_bind_verifycbk()
- BUG/MEDIUM: httpclient: Don't set EOM flag on an empty HTX message
- MINOR: httpclient/lua: Don't set req_payload callback if body is empty
- DOC/CLEANUP: lua-api: some minor corrections
- DOC: lua-api: updating toolbox link
- DOC/CLEANUP: lua-api: removing duplicate core.proxies attribute
- DOC: management: add forgotten "show startup-logs"
- DOC: management: "show startup-logs" for master CLI
- CI: Replace the deprecated `::set-output` command by writing to $GITHUB_OUTPUT in matrix.py
- CI: Replace the deprecated `::set-output` command by writing to $GITHUB_OUTPUT in workflow definition
the `::set-output` command is deprecated, because processes during the workflow
execution might output untrusted information that might include the
`::set-output` command, thus allowing these untrusted information to hijack the
build.
The replacement is writing to the file indicated by the `$GITHUB_OUTPUT`
environment variable.
Link to lua toolbox was dead (project has been deprecated).
Adding a legacy link to get old toolbox source code as well as
a link to luarocks that seems to have superseded it.
MINOR: httpclient/lua: Don't set req_payload callback if body is empty
The HTTPclient callback req_payload callback is set when a request payload
must be streamed. In the lua, this callback is set when a body is passed as
argument in one of httpclient functions (head/get/post/put/delete). However,
there is no reason to set it if body string is empty.
This patch is related to the issue #1898. It may be backported as far as
2.5.
BUG/MEDIUM: httpclient: Don't set EOM flag on an empty HTX message
In the HTTP client, when the request body is streamed, at the end of the
payload, we must be sure to not set the EOM flag on an empty message.
Otherwise, because there is no data, the buffer is reset to be released and
the flag is lost. Thus, the HTTP client is never notified of the end of
payload for the request and the applet is blocked. If the HTTP client is
instanciated from a Lua script, it is even worse because we fall into a
wakeup loop between the lua script and the HTTP client applet. At the end,
HAProxy is killed because of the watchdog.
This patch should fix the issue #1898. It must be backported to 2.6.
BUILD: ssl_sock: bind_conf uninitialized in ssl_sock_bind_verifycbk()
Even if this cannot happen, ensure <bind_conf> is initialized in this
function to please some compilers.
Takes the opportunity of this patch to replace an ABORT_NOW() by
a BUG_ON() because if the variable values they test are not initialized,
this is really because there is a bug.
Willy Tarreau [Thu, 13 Oct 2022 14:14:11 +0000 (16:14 +0200)]
MEDIUM: quic: respect the threads assigned to a bind line
Right now the QUIC thread mapping derives the thread ID from the CID
by dividing by global.nbthread. This is a problem because this makes
QUIC work on all threads and ignores the "thread" directive on the
bind lines. In addition, only 8 bits are used, which is no more
compatible with the up to 4096 threads we may have in a configuration.
Let's modify it this way:
- the CID now dedicates 12 bits to the thread ID
- on output we continue to place the TID directly there.
- on input, the value is extracted. If it corresponds to a valid
thread number of the bind_conf, it's used as-is.
- otherwise it's used as a rank within the current bind_conf's
thread mask so that in the end we still get a valid thread ID
for this bind_conf.
The extraction function now requires a bind_conf in order to get the
group and thread mask. It was better to use bind_confs now as the goal
is to make them support multiple listeners sooner or later.
MINOR: logs: startup-logs can use a shm for logging the reload
When compiled with USE_SHM_OPEN=1 the startup-logs are now able to use
an shm which is used to keep the logs when switching to mworker wait
mode. This allows to keep the failed reload logs.
When allocating the startup-logs at first start of the process, haproxy
will do a shm_open with a unique path using the PID of the process, the
file is unlink immediatly so we don't let unwelcomed files be. The fd
resulting from this shm is stored in the HAPROXY_STARTUPLOGS_FD
environment variable so it can be mmap again when switching to wait
mode.
When forking children, the process is copying the mmap to a a mallocated
ring so we never share the same memory section between the master and
the workers. When switching to wait mode, the shm is not used anymore as
it is also copied to a mallocated structure.
This allow to use the "show startup-logs" command over the master CLI,
to get the logs of the latest startup or reload. This way the logs of
the latest failed reload are also kept.
This is only activated on the linux-glibc target for now.
MINOR: buffers: split b_force_xfer() into b_cpy() and b_force_xfer()
Split the b_force_xfer() into b_ncat() and b_force_xfer().
The previous b_force_xfer() implementation was basically a copy with a
b_del on the src buffer. Keep this implementation to make b_ncat(), and
just call b_ncat() + b_del() into b_force_xfer().
MINOR: quic: implement datagram cleanup for quic_receiver_buf
Each time data is read on QUIC receiver socket, we try to reuse the
first datagram of the currently used quic_receiver_buf instead of
allocating a new one. This algorithm is suboptimal if there is several
unused datagrams as only the first one is tested and its buffer removed
from quic_receiver_buf.
If QUIC traffic is quite substential, this can lead to an important
number of quic_dgram occurences allocated from pool_head_quic_dgram and
a lack of free space in allocated quic_receiver_buf buffers.
To improve this, each time we want to reuse a datagram, we pop elements
until a non-yet released datagram is found or the list is empty. All
intermediary elements are freed and the last found datagram can be
reused. This operation has been extracted in a dedicated function named
quic_rxbuf_purge_dgrams().
This should improve memory consumption incured by quic_dgram instances under heavy
QUIC traffic. Note that there is still room for improvement as if the
first datagram is still in use, it may block several unused datagram
after him. However this requires to support removal of datagrams out of
order which is currently not possible.
CLEANUP: quic: improve naming for rxbuf/datagrams handling
QUIC datagrams are read from a random thread. They are then redispatch
to the connection thread according to the first packet DCID. These
operations are implemented through a special buffer designed to avoid
locking.
Refactor this code with the following changes :
* <rxbuf> type is renamed <quic_receiver_buf>. Its list element is also
renamed to highligh its attach point to a receiver.
* <quic_dgram> and <quic_receiver_buf> definition are moved to
quic_sock-t.h. This helps to reduce the size of quic_conn-t.h.
* <quic_dgram> list elements are renamed to highlight their attach point
into a <quic_receiver_buf> and a <quic_dghdlr>.
Amaury Denoyelle [Thu, 13 Oct 2022 08:11:36 +0000 (10:11 +0200)]
CLEANUP: quic: remove unused rxbufs member in receiver
rxbuf is the structure used to store QUIC datagrams and redispatch them
to the connection thread.
Each receiver manages a list of rxbuf. This was stored both as an array
and a mt_list. Currently, only mt_list is needed so removed <rxbufs>
member from receiver structure.
MINOR: quic: Split the secrets key allocation in two parts
Implement quic_tls_secrets_keys_alloc()/quic_tls_secrets_keys_free() to allocate
the memory for only one direction (RX or TX).
Modify ha_quic_set_encryption_secrets() to call these functions for one of this
direction (or both). So, for now on we can rely on the value of the secret keys
to know if it was derived.
Remove QUIC_FL_TLS_SECRETS_SET flag which is no more useful.
Consequently, the secrets are dumped by the traces only if derived.
BUG/MINOR: quic: Stalled 0RTT connections with big ClientHello TLS message
This issue was reproduced with -Q picoquic client option to split a big ClientHello
message into two Initial packets and haproxy as server without any knowledged of
any previous ORTT session (restarted after a firt 0RTT session). The ORTT received
packets were removed from their queue when the second Initial packet was parsed,
and the QUIC handshake state never progressed and remained at Initial state.
To avoid such situations, after having treated some Initial packets we always
check if there are ORTT packets to parse and we never remove them from their
queue. This will be done after the hanshake is completed or upon idle timeout
expiration.
Also add more traces to be able to analize the handshake progression.
MINOR: quic: Use a non-contiguous buffer for RX CRYPTO data
Implement quic_get_ncbuf() to dynamically allocate a new ncbuf to be attached to
any quic_cstream struct which needs such a buffer. Note that there is no quic_cstream
for 0RTT encryption level. quic_free_ncbuf() is added to release the memory
allocated for a non-contiguous buffer.
Modify qc_handle_crypto_frm() to call this function and allocate an ncbuf for
crypto data which are not received in order. The crypto data which are received in
order are not buffered but provide to the TLS stack (calling qc_provide_cdata()).
Modify qc_treat_rx_crypto_frms() which is called after having provided the
in order received crypto data to the TLS stack to provide again the remaining
crypto data which has been buffered, if possible (if they are in order). Each time
buffered CRYPTO data were consumed, we try to release the memory allocated for
the non-contiguous buffer (ncbuf).
Also move rx.crypto.offset quic_enc_level struct member to rx.offset quic_cstream
struct member.
MINOR: quic: New quic_cstream object implementation
Add new quic_cstream struct definition to implement the CRYPTO data stream.
This is a simplication of the qcs object (QUIC streams) for the CRYPTO data
without any information about the flow control. They are not attached to any
tree, but to a QUIC encryption level, one by encryption level except for
the early data encryption level (for 0RTT). A stream descriptor is also allocated
for each CRYPTO data stream.
Ilya Shipitsin [Tue, 11 Oct 2022 07:10:57 +0000 (12:10 +0500)]
CI: SSL: use proper version generating when "latest" semantic is used
both "OPENSSL_VERSION=latest" and "LIBRESSL_VERSION=latest" processing
introduced errors when build-ssl.sh script was invoked. that error
in turn led to skipping custom openssl build and haproxy was linked against
stock openssl, i.e. openssl-1.1.1
Willy Tarreau [Wed, 12 Oct 2022 18:58:18 +0000 (20:58 +0200)]
MEDIUM: checks: spread the checks load over random threads
The CPU usage pattern was found to be high (5%) on a machine with
48 threads and only 100 servers checked every second That was
supposed to be only 100 connections per second, which should be very
cheap. It was figured that due to the check tasks unbinding from any
thread when going back to sleep, they're queued into the shared queue.
Not only this requires to manipulate the global queue lock, but in
addition it means that all threads have to check the global queue
before going to sleep (hence take a lock again) to figure how long
to sleep, and that they would all sleep only for the shortest amount
of time to the next check, one would pick it and all other ones would
go down to sleep waiting for the next check.
That's perfectly visible in time-to-first-byte measurements. A quick
test consisting in retrieving the stats page in CSV over a 48-thread
process checking 200 servers every 2 seconds shows the following tail:
One solution could consist in forcefully binding checks to threads at
boot time, but that's annoying, will cause trouble for dynamic servers
and may cause some skew in the load depending on some server patterns.
Instead here we take a different approach. A check remains bound to its
thread for as long as possible, but upon every wakeup, the thread's load
is compared with another random thread's load. If it's found that that
other thread's load is less than half of the current one's, the task is
bounced to that thread. In order to prevent that new thread from doing
the same, we set a flag "CHK_ST_SLEEPING" that indicates that it just
woke up and we're bouncing the task only on this condition.
Tests have shown that the initial load was very unfair before, with a few
checks threads having a load of 15-20 and the vast majority having zero.
With this modification, after two "inter" delays, the load is either zero
or one everywhere when checks start. The same test shows a CPU usage that
significantly drops, between 0.5 and 1%. The same latency tail measurement
is much better, roughly 10 times smaller:
In fact one difference here is that many threads work while in the past
they were waking up and going down to sleep after having perturbated the
shared lock. Thus it is anticipated that this will scale way smoother
than before. Under strace it's clearly visible that all threads are
sleeping for the time it takes to relaunch a check, there's no more
thundering herd wakeups.
However it is also possible that in some rare cases such as very short
check intervals smaller than a scheduler's timeslice (such as 4ms),
some users might have benefited from the work being concentrated on
less threads and would instead observe a small increase of apparent
CPU usage due to more total threads waking up even if that's for less
work each and less total work. That's visible with 200 servers at 4ms
where show activity shows that a few threads were overloaded and others
doing nothing. It's not a problem, though as in practice checks are not
supposed to eat much CPU and to wake up fast enough to represent a
significant load anyway, and the main issue they could have been
causing (aside the global lock) is an increase last-percentile latency.
Willy Tarreau [Wed, 12 Oct 2022 19:48:17 +0000 (21:48 +0200)]
MINOR: checks: use the lighter PRNG for spread checks
There's no point using ha_random32() which is heavy and uses shared
variables to calculate a random timer when we have statistical_prng()
which does the same and was made exactly for this.
Willy Tarreau [Wed, 12 Oct 2022 19:40:31 +0000 (21:40 +0200)]
BUG/MINOR: server: make sure "show servers state" hides private bits
In the past we've seen "show servers state" dump some internal bits for
the check states, that were causing regtests to fail. The relevant bits
have been added to the doc to fix the public API and make sure they do
not change by accident, but the output doesn't take care of masking the
undesired ones, causing regtests (and possibly user programs) to fail
when new bits are added. Let's add the mask for the only documented ones
(0x0F for check and 0x1F for agent respectively).
This could be backported wherever the server state is present, though
there's a tiny risk that some undocumented bits might have already
leaked to some user scripts, so it might be wise to wait a bit before
doing that or even not to backport too far.
BUG/MEDIUM: mux-h1: Handle abort with an incomplete message during parsing
In h1_process_demux(), aborts for incomplete messages were not properly
handled. It was not an issue because the abort was detected later in
h1_process(). But it will be an issue to perform the aborts refoctoring.
First, when a read0 was detected, the SE_FL_EOI flag was set for messages in
DONE or TUNNEL state or for messages without known length (so responses in
close mode). The last statement is not accurate. The message must also be in
DATA state. Otherwise, SE_FL_EOI flag may be set on incomplete message.
Then, an error was reported, via SE_FL_ERROR flag, only when an incomplete
message was detected on the payload parsing. It must also be reported if
headers are incomplete. Here again, the error is detected later for now. But
it could be an issue later.
BUG/MEDIUM: mux-h1: Add connection error handling when reading/sending on a pipe
There is no error handling when we read or write on a pipe. There error is
caught later, in the mux I/O handler. But there is no reason to not do so
here.
There is no reason to backport it because no issue was reported for now
because of this "bug". In all cases, it must be evaluated first.
REORG: mux-fcgi: Extract flags and enums into mux_fcgi-t.h
The same was performed for the H2 and H1 multiplexers. FCGI connection and
stream flags are moved in a dedicated header file. It will be mainly used to
be able to decode mux-fcgi flags from the flags utility.
In this patch, we move the flags and enums to mux_fcgi-t.h, as well as the
two state decoding inline functions.
Willy Tarreau [Wed, 12 Oct 2022 10:04:01 +0000 (10:04 +0000)]
MEDIUM: stick-table: requeue the wakeup task out of the write lock
We don't need to call stktable_requeue_exp() with the table's lock
held anymore, so let's move it out. It should slightly reduce the
contention on the write lock, though it is now already quite low.
Willy Tarreau [Wed, 12 Oct 2022 10:00:50 +0000 (10:00 +0000)]
MEDIUM: stick-table: always use atomic ops to requeue the table's task
We're generalizing the change performed in previous commit "MEDIUM:
stick-table: requeue the expiration task out of the exclusive lock"
to stktable_requeue_exp() so that it can also be used by callers of
__stktable_store(). At the moment there's still no visible change
since it's still called under the write lock. However, the previous
code in stitable_touch_with_exp() was updated to use this function.
Willy Tarreau [Wed, 12 Oct 2022 09:56:08 +0000 (09:56 +0000)]
MINOR: stick-table: split stktable_store() between key and requeue
__staktable_store() performs two distinct things, one is to insert a key
and the other one is to requeue the task's expiration date. Since the
latter might be done without a lock, let's first split the function in
two halves. For now this has no impact.
Willy Tarreau [Wed, 12 Oct 2022 09:45:36 +0000 (09:45 +0000)]
MEDIUM: stick-table: requeue the expiration task out of the exclusive lock
With 48 threads, a heavily loaded table with plenty of trackers and
rules and a short expiration timer of 10ms saturates the CPU at 232k
rps. By carefully using atomic ops we can make sure that t->exp_next
and t->task->expire converge to the earliest next expiration date and
that all of this can be performed under atomic ops without any lock.
That's what this patch is doing in stktable_touch_with_exp(). This is
sufficient to double the performance and reach 470k rps.
It's worth noting that __stktable_store() uses a mix of eb32_insert()
and task_queue, and that the second part of it could possibly benefit
from this, even though sometimes it's called under a lock that was
already held.
Willy Tarreau [Wed, 12 Oct 2022 09:13:14 +0000 (09:13 +0000)]
MEDIUM: stick-table: make stktable_set_entry() look up under a read lock
On a 24-core machine having some "stick-store response" rules, a lot of
time is spent in the write lock in stktable_set_entry(). Let's apply the
same mechanism as for the stktable_get_entry() consisting in looking up
the value under the read lock and upgrading it to a write lock only to
perform modifications. Here we even have the luxury of upgrading the
lock since there are no alloc/free in the path. All this increases the
performance by 40% (from 363k to 510k rps).
Willy Tarreau [Tue, 11 Oct 2022 14:19:35 +0000 (16:19 +0200)]
MEDIUM: stick-table: do not take a lock to update t->current anymore.
We don't need to be protected by the table's lock when touching t->current
if we do it using atomics, and that's great because it allows us to have
a cleaner stksess_new() that doesn't require a lock either, and to avoid
manipulating pools under a lock.
That's another 1% performance gain from 2.07 to 2.10M req/s under 48
threads.
Willy Tarreau [Tue, 11 Oct 2022 13:22:42 +0000 (15:22 +0200)]
MEDIUM: stick-table: make stktable_get_entry() look up under a read lock
On a 24-core machine doing lots of track-sc, it was found that the lock
in stktable_get_entry() was responsible for 25% of the CPU alone. It's
sad because most of its job is to protect the table during the lookup.
Here we're taking a slightly different approach: the lock is first taken
for reads during the lookup, and only in case of failure we switch it for
a write lock. We don't even perform an upgrade here since an allocation
is needed between the two, it would be wasted to do it under the lock,
and is generally not a good idea, so better release the read lock and
try again.
Here the performance under 48 threads with 3 trackers on the same table
jumped from 455k to 2.07M, or 4.55x! Note that the same approach should
be possible for stktable_set_entry().
Willy Tarreau [Tue, 11 Oct 2022 13:42:54 +0000 (15:42 +0200)]
MEDIUM: stick-table: switch to rdlock in stktable_lookup() and lookup_key()
These functions do not modify anything in the the table except the refcount
on success. Let's just lock the table for shared accesses and make use of
atomic ops to update the refcount. This brings a nice gain from 425k to
455k under 48 threads (7%), but some contention remains on the exclusive
locks in other parts.
Note that the refcount continues to be updated under the lock because it's
not yet certain whether there are races between it and some of the exclusive
lock on the table. The difference is marginal and we prefer to stay on the
safe side for now.
Willy Tarreau [Tue, 11 Oct 2022 13:13:46 +0000 (15:13 +0200)]
MEDIUM: stick-table: free newly allocated stkess if it couldn't be inserted
In __stktable_get_entry() now we're planning for the possibility that the
call to __stktable_store() doesn't add the newly allocated entry and instead
finds a previously inserted one. At the moment this doesn't exist because
the lookup + insert passes are made under the same lock. But it will soon
change.
Willy Tarreau [Tue, 11 Oct 2022 13:09:46 +0000 (15:09 +0200)]
MEDIUM: stick-table: return inserted entry in __stktable_store()
This function is used to create an entry in the table. But it doesn't
consider the possibility that the entry already exists, because right
now it's only called in situations where it was verified under a lock
that it doesn't exist. Since we'll soon need to break that assumption
we need it to verify that the requested entry was added and to return
a pointer to the one in the tree so that the caller can detect any
possible conflict. At the moment this is not used.