OpenSSL version greater than 3.0 does not use the same API when
manipulating EVP_PKEY structures, the EC_KEY API is deprecated and it's
not possible anymore to get an EC_GROUP and simply call
EC_GROUP_get_curve_name().
Instead, one must call EVP_PKEY_get_utf8_string_param with the
OSSL_PKEY_PARAM_GROUP_NAME parameter, but this would result in a SECG
curves name, instead of a NIST curves name in previous version.
(ex: secp384r1 vs P-384)
This patch adds 2 functions:
- the first one look for a curves name and converts it to an openssl
NID.
- the second one converts a NID to a NIST curves name
The list only contains: P-256, P-384 and P-521 for now, it could be
extended in the fure with more curves.
DEBUG: init: add a way to register functions for unit tests
Doing unit tests with haproxy was always a bit difficult, some of the
function you want to test would depend on the buffer or trash buffer
initialisation of HAProxy, so building a separate main() for them is
quite hard.
This patch adds a way to register a function that can be called with the
"-U" parameter on the command line, will be executed just after
step_init_1() and will exit the process with its return value as an exit
code.
When using the -U option, every keywords after this option is passed to
the callback and could be used as a parameter, letting the capability to
handle complex arguments if required by the test.
HAProxy need to be built with DEBUG_UNIT to activate this feature.
Willy Tarreau [Mon, 3 Mar 2025 02:58:46 +0000 (03:58 +0100)]
BUG/MINOR: server: check for either proxy-protocol v1 or v2 to send hedaer
As reported in issue #2882, using "no-send-proxy-v2" on a server line does
not properly disable the use of proxy-protocol if it was enabled in a
default-server directive in combination with other PP options. The reason
for this is that the sending of a proxy header is determined by a test on
srv->pp_opts without any distinction, so disabling PPv2 while leaving other
options results in a PPv1 header to be sent.
Let's fix this by explicitly testing for the presence of either send-proxy
or send-proxy-v2 when deciding to send a proxy header.
This can be backported to all versions. Thanks to Andre Sencioles (@asenci)
for reporting the issue and testing the fix.
Amaury Denoyelle [Thu, 27 Feb 2025 17:07:17 +0000 (18:07 +0100)]
BUG/MINOR: hq-interop: fix leak in case of rcv_buf early return
HTTP/0.9 parser was recently updated to support truncated requests in
rcv_buf operation. However, this caused a leak as input buffer is
allocated early.
In fact, the leak was already present in case of fatal errors. Fix this
by first delaying buffer allocation, so that initial checks are
performed before. Then, ensure that buffer is released in case of a
latter error.
This is considered as minor, as HTTP/0.9 is reserved for experiment and
QUIC interop usages.
Willy Tarreau [Fri, 28 Feb 2025 12:35:13 +0000 (13:35 +0100)]
MINOR: h1: permit to relax the websocket checks for missing mandatory headers
At least one user would like to allow a standards-violating client setup
WebSocket connections through haproxy to a standards-violating server that
accepts them. While this should of course never be done over the internet,
it can make sense in the datacenter between application components which do
not need to mask the data, so this typically falls into the situation of
what the "accept-unsafe-violations-in-http-request" option and the
"accept-unsafe-violations-in-http-response" option are made for.
See GH #2876 for more context.
This patch relaxes the test on the "Sec-Websocket-Key" header field in
the request, and of the "Sec-Websocket-Accept" header in the response
when these respective options are set.
The doc was updated to reference this addition. This may be backported
to 3.1 but preferably not further.
BUG/MEDIUM: mux-fcgi: Try to fully fill demux buffer on receive if not empty
Don't reserve space for the HTX overhead on receive if the demux buffer is
not empty. Otherwise, the demux buffer may be erroneously reported as full
and this may block records processing. Because of this bug, a ping-pong loop
till timeout between data reception and demux process can be observed.
This bug was introduced by the commit 5f927f603 ("BUG/MEDIUM: mux-fcgi:
Properly handle read0 on partial records"). To fix the issue, if the demux
buffer is not empty when we try to receive more data, all free space in the
buffer can now be used. However, if the demux buffer is empty, we still try
to keep it aligned with the HTX.
Extends HTTP/0.9 layer to be able to deal with incomplete requests.
Instead of an error, 0 is returned. Thus, instead of a stream closure.
QUIC-MUX may retry rcv_buf operation later if more data is received,
similarly to HTTP/3 layer.
Note that HTTP/0.9 is only used for testing and interop purpose. As
such, this limitation is not considered as a bug. It is probably not
worth to backport it.
Amaury Denoyelle [Thu, 27 Feb 2025 10:28:07 +0000 (11:28 +0100)]
CLEANUP: h3: fix documentation of h3_rcv_buf()
Return value of h3_rcv_buf() is incorrectly documented. Indeed, it may
return a positive value to indicate that input bytes were converted into
HTX. This is especially important, as caller uses this value to consume
the reported data amount in QCS Rx buffer.
This should be backported up to 2.6. Note that on 2.8, h3_rcv_buf() was
named h3_decode_qcs().
Amaury Denoyelle [Thu, 27 Feb 2025 15:38:39 +0000 (16:38 +0100)]
BUG/MINOR: h3: do not report transfer as aborted on preemptive response
HTTP/3 specification allows a server to emit the entire response even if
only a partial request was received. In particular, this happens when
request STREAM FIN is delayed and transmitted in an empty payload frame.
In this case, qcc_abort_stream_read() was used by HTTP/3 layer to emit a
STOP_SENDING. Remaining received data were not transmitted to the stream
layer as they were simply discared. However, this prevents FIN
transmission to the stream layer. This causes the transfer to be
considered as prematurely closed, resulting in a cL-- log line status.
This is misleading to users which could interpret it as if the response
was not sent.
To fix this, disable STOP_SENDING emission on full preemptive reponse
emission. Rx channel is kept opened until the client closes it with
either a FIN or a RESET_STREAM. This ensures that the FIN signal can be
relayed to the stream layer, which allows the transfer to be reported as
completed.
The PROXY v2 TLVs were not properly initialized when defined with
"set-proxy-v2-tlv-fmt" keyword, which could have caused a crash when
validating the configuration or malfunction (e.g. when used in
combination with "server-template" and/or "default-server").
The issue was introduced with commit 6f4bfed3a ("MINOR: server: Add
parser support for set-proxy-v2-tlv-fmt").
Maxconn is a bit of a misnomer when it comes to servers, as it doesn't
control the maximum number of connections we establish to a server, but
the maximum number of simultaneous requests. So add "strict-maxconn",
that will make it so we will never establish more connections than
maxconn.
It extends the meaning of the "restricted" setting of
tune.takeover-other-tg-connections, as it will also attempt to get idle
connections from other thread groups if strict-maxconn is set.
Olivier Houchard [Thu, 30 Jan 2025 10:16:40 +0000 (11:16 +0100)]
MEDIUM: connections: Allow taking over connections from other tgroups.
Allow haproxy to take over idle connections from other thread groups
than our own. To control that, add a new tunable,
tune.takeover-other-tg-connections. It can have 3 values, "none", where
we won't attempt to get connections from the other thread group (the
default), "restricted", where we only will try to get idle connections
from other thread groups when we're using reverse HTTP, and "full",
where we always try to get connections from other thread groups.
Unless there is a special need, it is advised to use "none" (or
restricted if we're using reverse HTTP) as using connections from other
thread groups may have a performance impact.
Olivier Houchard [Tue, 25 Feb 2025 17:28:45 +0000 (18:28 +0100)]
MEDIUM: pollers: Drop fd events after a takeover to another tgid.
In pollers that support it, provide the generation number in addition to
the fd, and, when an event happened, if the generation number is the
same, but the tgid changed, then assumed the fd was taken over by a
thread from another thread group, and just delete the event from the
current thread's poller, as we no longer want to hear about it.
Olivier Houchard [Tue, 25 Feb 2025 17:26:49 +0000 (18:26 +0100)]
MINOR: pollers: Add a fixup_tgid_takeover() method.
Add a fixup_tgid_takeover() method to pollers for which it makes sense
(epoll, kqueue and evport). That method can be called after a takeover
of a fd from a different thread group, to make sure the poller's
internal structure reflects the new state.
Olivier Houchard [Tue, 25 Feb 2025 11:01:40 +0000 (12:01 +0100)]
MEDIUM: epoll: Make sure we can add a new event
Check that the call to epoll_ctl() succeeds, and if it does not, if
we're adding a new event and it fails with EEXIST, then delete and
re-add the event. There are a few cases where we may already have events
for a fd. If epoll_ctl() fails for any reason, use BUG_ON to make sure
we immediately crash, as this should not happen.
Willy Tarreau [Tue, 25 Feb 2025 07:42:45 +0000 (08:42 +0100)]
OPTIM: connection: don't try to kill other threads' connection when !shared
Users may have good reasons for using "tune.idle-pool.shared off", one of
them being the cost of moving cache lines between cores, or the kernel-
side locking associated with moving FDs. For this reason, when getting
close to the file descriptors limits, we must not try to kill adjacent
threads' FDs when the sharing of pools is disabled. This is extremely
expensive and kills the performance. We must limit ourselves to our local
FDs only. In such cases, it's up to the users to configure a large enough
maxconn for their usages.
Before this patch, perf top reported 9% CPU usage in connect_server()
onthe trylock used to kill connections when running at 4800 conns for
a global maxconn of 6400 on a 128-thread server. Now it doesn't spend
its time there anymore, and performance has increased by 12%. Note,
it was verified that disabling the locks in such a case has no effect
at all, so better keep them and stay safe.
Willy Tarreau [Mon, 24 Feb 2025 10:43:15 +0000 (11:43 +0100)]
BUG/MEDIUM: stream: don't use localtime in dumps from a signal handler
In issue #2861, Jarosaw Rzeszótko reported another issue with
"show threads", this time in relation with the conversion of a stream's
accept date to local time. Indeed, if the libc was interrupted in this
same function, it could have been interrupted with a lock held, then
it's no longer possible to dump the date, and we face a deadlock.
This is easy to reproduce with logging enabled.
Let's detect we come from a signal handler and do not try to resolve
the time to localtime in this case.
Willy Tarreau [Mon, 24 Feb 2025 12:37:52 +0000 (13:37 +0100)]
MINOR: tinfo: split the signal handler report flags into 3
While signals are not recursive, one signal (e.g. wdt) may interrupt
another one (e.g. debug). The problem this causes is that when leaving
the inner handler, it removes the outer's flag, hence the protection
that comes with it. Let's just have 3 distinct flags for regular signals,
debug signal and watchdog signal. We add a 4th definition which is an
aggregate of the 3 to ease testing.
Willy Tarreau [Mon, 24 Feb 2025 08:17:22 +0000 (09:17 +0100)]
BUG/MINOR: h2: always trim leading and trailing LWS in header values
Annika Wickert reported some occasional disconnections between haproxy
and varnish when communicating over HTTP/2, with varnish complaining
about protocol errors while captures looked apparently normal. Nils
Goroll managed to reproduce this on varnish by injecting the capture of
the outgoing haproxy traffic and noticed that haproxy was forwarding a
header value containing a trailing space, which is now explicitly
forbidden since RFC9113.
It turns out that the only way for such a header to pass through haproxy
is to arrive in h2 and not be edited, in which case it will arrive in
HTX with its undesired spaces. Since the code dealing with HTX headers
always trims spaces around them, these are not observable in dumps, but
only when started in debug mode (-d). Conversions to/from h1 also drop
the spaces.
With this patch we trim LWS both on input and on output. This way we
always present clean headers in the whole stack, and even if some are
manually crafted by the configuration or Lua, they will be trimmed on
the output.
This must be backported to all stable versions.
Thanks to Annika for the helpful capture and Nils for the help with
the analysis on the varnish side!
This is the introduction of "minsize-req" and "minsize-res".
These two options allow you to set the minimum payload size required for
compression to be applied.
This helps save CPU on both server and client sides when the payload does
not need to be compressed.
Willy Tarreau [Fri, 21 Feb 2025 10:22:57 +0000 (11:22 +0100)]
CLEANUP: task: move the barrier after clearing th_ctx->current
There's a barrier after releasing the current task in the scheduler.
However it's improperly placed, it's done after pool_free() while in
fact it must be done immediately after resetting the current pointer.
Indeed, the purpose is to make sure that nobody sees the task as valid
when it's in the process of being released. This is something that
could theoretically happen if interrupted by a signal in the inlined
code of pool_free() if the compiler decided to postpone the write to
->current. In practice since nothing fancy is done in the inlined part
of the function, there's currently no risk of reordering. But it could
happen if the underlying __pool_free() were to be inlined for example,
and in this case we could possibly observe th_ctx->current pointing
to something currently being destroyed.
With the barrier between the two, there's no risk anymore.
Willy Tarreau [Fri, 21 Feb 2025 14:01:13 +0000 (15:01 +0100)]
MINOR: tools: use only opportunistic symbols resolution
As seen in issue #2861, dladdr_and_size() an be quite expensive and
will often hold a mutex in the underlying library. It becomes a real
problem when issuing lots of "show threads" or wdt warnings in parallel
because threads will queue up waiting for each other to finish, adding
to their existing latency that possibly caused the warning in the first
place.
Here we're taking a different approach. If the thread is not isolated
and not panicking, it's doing unimportant stuff like showing threads
or warnings. In this case we try to grab a lock, and if we fail because
another thread is already there, we just pretend we cannot resolve the
symbol. This is not critical because then we fall back to the already
used case which consists in writing "main+<offset>". In practice this
will almost never happen except in bad situations which could have
otherwise degenerated.
Willy Tarreau [Fri, 21 Feb 2025 17:23:23 +0000 (18:23 +0100)]
BUG/MEDIUM: stream: use non-blocking freq_ctr calls from the stream dumper
The stream dump function is called from signal handlers (warning, show
threads, panic). It makes use of read_freq_ctr() which might possibly
block if it tries to access a locked freq_ctr in the process of being
updated, e.g. by the current thread.
Here we're relying on the non-blocking API instead. It may return incorrect
values (typically smaller ones after resetting the curr counter) but at
least it will not block.
This needs to be backported to stable versions along with the previous
commit below:
MINOR: freq_ctr: provide non-blocking read functions
At least 3.1 is concerned as the warnings tend to increase the risk of
this situation appearing.
Willy Tarreau [Fri, 21 Feb 2025 17:21:56 +0000 (18:21 +0100)]
MINOR: freq_ctr: provide non-blocking read functions
Some code called by the debug handlers in the context of a signal handler
accesses to some freq_ctr and occasionally ends up on a locked one from
the same thread that is dumping it. Let's introduce a non-blocking version
that at least allows to return even if the value is in the process of being
updated, it's less problematic than hanging.
Willy Tarreau [Fri, 21 Feb 2025 15:45:18 +0000 (16:45 +0100)]
BUG/MEDIUM: stream: never allocate connection addresses from signal handler
In __strm_dump_to_buffer(), we call conn_get_src()/conn_get_dst() to try
to retrieve the connection's IP addresses. But this function may be called
from a signal handler to dump a currently running stream, and if the
addresses were not allocated yet, a poll_alloc() will be performed while
we might possibly already be running pools code, resulting in pool list
corruption.
Let's just make sure we don't call these sensitive functions there when
called from a signal handler.
This must be backported at least to 3.1 and ideally all other versions,
along with this previous commit:
MINOR: tinfo: add a new thread flag to indicate a call from a sig handler
Willy Tarreau [Fri, 21 Feb 2025 15:26:24 +0000 (16:26 +0100)]
MINOR: tinfo: add a new thread flag to indicate a call from a sig handler
Signal handlers must absolutely not change anything, but some long and
complex call chains may look innocuous at first glance, yet result in
some subtle write accesses (e.g. pools) that can conflict with a running
thread being interrupted.
Let's add a new thread flag TH_FL_IN_SIG_HANDLER that is only set when
entering a signal handler and cleared when leaving them. Note, we're
speaking about real signal handlers (synchronous ones), not deferred
ones. This will allow some sensitive call places to act differently
when detecting such a condition, and possibly even to place a few new
BUG_ON().
Willy Tarreau [Fri, 21 Feb 2025 10:31:36 +0000 (11:31 +0100)]
BUG/MINOR: mux-h1: always make sure h1s->sd exists in h1_dump_h1s_info()
This function may be called from a signal handler during a warning,
a panic or a show thread. We need to be more cautious about what may
or may not be dereferenced since an h1s is not necessarily fully
initialized. Loops of "show threads" sometimes manage to crash when
dereferencing a null h1s->sd, so let's guard it and add a comment
remining about the unusual call place.
Willy Tarreau [Fri, 21 Feb 2025 16:18:00 +0000 (17:18 +0100)]
BUG/MINOR: stream: do not call co_data() from __strm_dump_to_buffer()
co_data() was instrumented to detect cases where c->output > data and
emits a warning if that's not correct. The problem is that it happens
quite a bit during "show threads" if it interrupts traffic anywhere,
and that in some environments building with -DDEBUG_STRICT_ACTION=3,
it will kill the process.
Let's just open-code the channel functions that make access to co_data(),
there are not that many and the operations remain very simple.
This can be backported to 3.1. It didn't trigger in earlier versions
because they didn't have this CHECK_IF_HOT() test.
MINOR: clock: always use atomic ops for global_now_ms
global_now_ms is shared between threads so we must give hint to the
compiler that read/writes operations should be performed atomically.
Everywhere global_now_ms was used, atomic ops were used, except in
clock_update_global_date() where a read was performed without using
atomic op. In practise it is not an issue because on most systems
such reads should be atomic already, but to prevent any confusion or
potential bug on exotic systems, let's use an explicit _HA_ATOMIC_LOAD
there.
BUG/MINOR: sink: add tempo between 2 connection attempts for sft servers
When the connection for sink_forward_{oc}_applet fails or a previous one
is destroyed, the sft->appctx is instantly released.
However process_sink_forward_task(), which may run at any time, iterates
over all known sfts and tries to create sessions for orphan ones.
It means that instantly after sft->appctx is destroyed, a new one will
be created, thus a new connection attempt will be made.
It can be an issue with tcp log-servers or sink servers, because if the
server is unavailable, process_sink_forward() will keep looping without
any temporisation until the applet survives (ie: connection succeeds),
which results in unexpected CPU usage on the threads responsible for
that task.
Instead, we add a tempo logic so that a delay of 1second is applied
between two retries. Of course the initial attempt is not delayed.
While reviewing the code in an attempt to fix GH #2875, I stumbled
on another case similar to aac570c ("BUG/MEDIUM: uxst: fix outgoing
abns address family in connect()") that caused abns(z) addresses to
fail when used as log targets.
The underlying cause is the same as aac570c, which is the rework of the
unix socket families in order to support custom addresses for different
adressing schemes, where a real_family() was overlooked before passing
a haproxy-internal address struct to socket-oriented syscall.
To fix the issue, we first copy the target's addr, and then leverage
real_family() to set the proper low-level address family that is passed
to sendmsg() syscall.
It was proved in GH #2875 that the regtest was broken, at least for the
server-side abnsz, as the connect() was not performed using the proper
family, which results in kernel refusing to perform the call, while the
reg-test actually succeeds.
Indeed, in the test we used vtest client to connect to haproxy, which
then routed the request to another haproxy instance listening on an
abnsz socket, and this last haproxy was the one to answer the http
request.
As we only used "rxresp" in vtest client, the test succeeded with empty
responses, which was the case due to the server connection failing on the
first haproxy process.
Willy Tarreau [Fri, 21 Feb 2025 06:53:21 +0000 (07:53 +0100)]
BUG/MEDIUM: uxst: fix outgoing abns address family in connect()
Since we reworked the unix socket families in order to support custom
addresses for different addressing schemes, we've been using extra
values for the ss_family field in sockaddr_storage. These ones have
to be adjusted before calling bind() or connect(). It turns out that
after the abns/abnsz updates in 3.1, the connect() code was not adjusted
to take care of the change, resulting in AF_CUST_ABNS or AF_CUST_ABNSZ
to be placed in the address that was passed to connect().
The right approach is to locally copy the address, get its length,
fixup the family and use the fixed value and length for connect().
This must be backported to 3.1. Many thanks for @Mewp for reporting
this issue in github issue #2875.
BUG/MINOR: cfgparse: fix NULL ptr dereference in cfg_parse_peers
When "peers" keyword is followed by more than one argument and it's the first
"peers" section in the config, cfg_parse_peers() detects it and exits with
"ERR_ALERT|ERR_FATAL" err_code.
So, upper layer parser, parse_cfg(), continues and parses the next keyword
"peer" and then he tries to check the global cfg_peers, which should contain
"my_cluster". The global cfg_peers is still NULL, because after alerting a user
in alertif_too_many_args, cfg_parse_peers() exited.
In order to fix this, let's add ERR_ABORT, if "peers" keyword is followed by
more than one argument. Like this parse_cfg() will stops immediately and
terminates haproxy with "too many args for peers my_cluster..." alert message.
It's more reliable, than add checks "if (cfg_peers !=NULL)" in "peer"
subparser, as we may have many "peers" sections.
In addition, for the example above, parse_cfg() will parse all configuration
until the end and only then terminates haproxy with the alert
"too many args...". Peer haproxy1 will be wrongly associated with
my_another_cluster.
This fixes the issue #2872.
This should be backported in all stable versions.
BUG/MEDIUM: spoe/mux-spop: Introduce an NOOP action to deal with empty ACK
In the SPOP protocol, ACK frame with empty payload are allowed. However, in
that case, because only the payload is transferred, there is no data to
return to the SPOE applet. Only the end of input is reported. Thus the
applet is never woken up. It means that the SPOE filter will be blocked
during the processing timeout and will finally return an error.
To workaournd this issue, a NOOP action is introduced with the value 0. It
is only an internal action for now. It does not exist in the SPOP
protocol. When an ACK frame with an empy payload is received, this noop
action is transferred to the SPOE applet, instead of nothing. Thanks to this
trick, the applet is properly notified. This works because unknown actions
are ignored by the SPOE filter.
BUG/MEDIUM: applet: Don't handle EOI/EOS/ERROR is applet is waiting for room
The commit 7214dcd52 ("BUG/MEDIUM: applet: Don't pretend to have more data
to handle EOI/EOS/ERROR") introduced a regression. Because of this patch, it
was possible to handle EOI/EOS/ERROR applet flags too early while the applet
was waiting for more room to transfer the last output data.
This bug can be encountered with any applet using its own buffers (cache and
stats for instance). And depending on the configuration and the timing, the
data may be truncated or the stream may be blocked, infinitely or not.
Streams blocked infinitely were observed with the cache applet and the HTTP
compression enabled.
For the record, it is important to detect EOI/EOS/ERROR applet flags to be
able to report the corresponding event on the SE and by transitivity on the
SC. Most of time, this happens when some data should be transferred to the
stream. The .rcv_buf callback function is called and these flags are
properly handled. However, some applets may also report them spontaneously,
outside of any data transfer. In that case, the .rcv_buf callback is not
called.
It is the purpose of this patch (and the one above). Being able to detect
pending EOI/EOS/ERROR applet flags. However, we must be sure to not handle
them too early at this place. When these flags are set, it means no more
data will be produced by the applet. So we must only wait to have
transferred everything to the stream. And this happens when the applet is no
longer waiting for more room.
This patch must be backported to 3.1 with the one above.
Willy Tarreau [Wed, 19 Feb 2025 17:39:51 +0000 (18:39 +0100)]
[RELEASE] Released version 3.2-dev6
Released version 3.2-dev6 with the following main changes :
- BUG/MEDIUM: debug: close a possible race between thread dump and panic()
- DEBUG: thread: report the spin lock counters as seek locks
- DEBUG: thread: make lock time computation more consistent
- DEBUG: thread: report the wait time buckets for lock classes
- DEBUG: thread: don't keep the redundant _locked counter
- DEBUG: thread: make lock_stat per operation instead of for all operations
- DEBUG: thread: reduce the struct lock_stat to store only 30 buckets
- MINOR: lbprm: add a new callback ->server_requeue to the lbprm
- MEDIUM: server: allocate a tasklet for asyncronous requeuing
- MAJOR: leastconn: postpone the server's repositioning under contention
- BUG/MINOR: quic: reserve length field for long header encoding
- BUG/MINOR: quic: fix CRYPTO payload size calcul for encoding
- MINOR: quic: simplify length calculation for STREAM/CRYPTO frames
- BUG/MINOR: mworker: section ignored in discovery after a post_section_parser
- BUG/MINOR: mworker: post_section_parser for the last section in discovery
- CLEANUP: mworker: "program" section does not have a post_section_parser anymore
- MEDIUM: initcall: allow to register mutiple post_section_parser per section
- CI: cirrus-ci: bump FreeBSD image to 14-2
- DOC: initcall: name correctly REGISTER_CONFIG_POST_SECTION()
- REGTESTS: stop using truncated.vtc on freebsd
- MINOR: quic: refactor STREAM encoding and splitting
- MINOR: quic: refactor CRYPTO encoding and splitting
- BUG/MEDIUM: fd: mark FD transferred to another process as FD_CLONED
- BUG/MINOR: ssl/cli: "show ssl crt-list" lacks client-sigals
- BUG/MINOR: ssl/cli: "show ssl crt-list" lacks sigals
- MINOR: ssl/cli: display more filenames in 'show ssl cert'
- DOC: watchdog: document the sequence of the watchdog and panic
- MINOR: ssl: store the filenames resulting from a lookup in ckch_conf
- MINOR: startup: allow hap_register_feature() to enable a feature in the list
- MINOR: quic: support frame type as a varint
- BUG/MINOR: startup: leave at first post_section_parser which fails
- BUG/MINOR: startup: hap_register_feature() fix for partial feature name
- BUG/MEDIUM: cli: Be sure to drop all input data in END state
- BUG/MINOR: cli: Wait for the last ACK when FDs are xferred from the old worker
- BUG/MEDIUM: filters: Handle filters registered on data with no payload callback
- BUG/MINOR: fcgi: Don't set the status to 302 if it is already set
- MINOR: ssl/crtlist: split the ckch_conf loading from the crtlist line parsing
- MINOR: ssl/crtlist: handle crt_path == cc->crt in crtlist_load_crt()
- MINOR: ssl/ckch: return from ckch_conf_clean() when conf is NULL
- MEDIUM: ssl/crtlist: "crt" keyword in frontend
- DOC: configuration: document the "crt" frontend keyword
- DEV: h2: add a Lua-based HTTP/2 connection tracer
- BUG/MINOR: quic: prevent crash on conn access after MUX init failure
- BUG/MINOR: mux-quic: prevent crash after MUX init failure
- DEV: h2: fix flags for the continuation frame
- REGTESTS: Fix truncated.vtc to send 0-CRLF
- BUG/MINOR: mux-h2: Properly handle full or truncated HTX messages on shut
- Revert "REGTESTS: stop using truncated.vtc on freebsd"
- MINOR: mux-quic: define a QCC application state member
- MINOR: mux-quic/h3: emit SETTINGS via MUX tasklet handler
- MINOR: mux-quic/h3: support temporary blocking on control stream sending
Amaury Denoyelle [Tue, 18 Feb 2025 15:14:19 +0000 (16:14 +0100)]
MINOR: mux-quic/h3: support temporary blocking on control stream sending
When HTTP/3 layer is initialized via QUIC MUX, it first emits a SETTINGS
frame on an unidirectional control stream. However, this could be
prevented if client did not provide initial flow control.
Previously, QUIC MUX was unable to deal with such situation. Thus, the
connection was closed immediately and no transfer could occur. Improve
this by extending QUIC MUX application layer API : initialization may
now return a transient error. This allows MUX to continue to use the
connection normally. Initialization will be retried periodically alter
until it can succeed.
This new API allows to deal with the flow control issue described above.
Note that this patch is not considered as a bug fix. Indeed, clients are
strongly advised to provide enough flow control for a SETTINGS frame
exchange.
Amaury Denoyelle [Mon, 17 Feb 2025 15:00:03 +0000 (16:00 +0100)]
MINOR: mux-quic/h3: emit SETTINGS via MUX tasklet handler
Previously, QUIC MUX application layer was installed and initialized via
MUX init. However, the latter stage involve I/O operations, for example
when using HTTP/3 with the emission of a SETTINGS frame.
Change this to prevent any I/O operations during MUX init. As such,
finalize app_ops callback is now called during the first invokation of
qcc_io_send(), in the context of MUX tasklet. To implement this, a new
application state value is added, to detect the transition from NULL to
INIT stage.
Amaury Denoyelle [Tue, 18 Feb 2025 10:42:59 +0000 (11:42 +0100)]
MINOR: mux-quic: define a QCC application state member
Introduce a new QCC field to track the current application layer state.
For the moment, only INIT and SHUT state are defined. This allows to
replace the older flag QC_CF_APP_SHUT.
This commit does not bring major changes. It is only necessary to permit
future evolutions on QUIC MUX. The only noticeable change is that QMUX
traces can now display this new field.
Thanks to the previous fixes ("REGTESTS: Fix truncated.vtc to send 0-CRLF" and
"BUG/MINOR: mux-h2: Properly handle full or truncated HTX messages on shut"),
this script can be reenabled for FreeBSD.
BUG/MINOR: mux-h2: Properly handle full or truncated HTX messages on shut
On shut, truncated HTX messages were not properly handled by the H2
multiplexer. Depending on how data were emitted, a chunked HTX message
without the 0-CRLF could be considered as full and an empty data with ES
flag set could be emitted instead of a RST_STREAM(CANCEL) frame.
In the H2 multiplexer, when a shut is performed, an HTX message is
considered as truncated if more HTX data are still expected. It is based on
the presence or not of the H2_SF_MORE_HTX_DATA flag on the H2 stream.
However, this flag is set or unset depending on the HTX extra field
value. This field is used to state how much data that must still be
transferred, based on the announced data length. For a message with a
content-length, this assumption is valid. But for a chunked message, it is
not true. Only the length of the current chunk is announced. So we cannot
rely on this field in that case to know if a message is full or not.
Instead, we must rely on the HTX start-line flags to know if more HTX data
are expected or not. If the xfer length is known (the HTX_SL_F_XFER_LEN flag
is set on the HTX start-line), it means that more data are always expected,
until the end of message is reached (the HTX_FL_EOM flag is set on the HTX
message). This is true for bodyless message because the end of message is
reported with the end of headers. This is also true for tunneled messages
because the end of message is received before switching the H2 stream in
tunnel mode.
Amaury Denoyelle [Mon, 17 Feb 2025 09:54:41 +0000 (10:54 +0100)]
BUG/MINOR: mux-quic: prevent crash after MUX init failure
qmux_init() may fail for several reasons. In this case, connection
resources are freed and underlying and a CONNECTION_CLOSE will be
emitted via its quic_conn instance.
In case of qmux_init() failure, qcc_release() is used to clean up
resources, but QCC <conn> member is first resetted to NULL, as
connection released must be delayed. Some cleanup operations are thus
skipped, one of them is the resetting of <ctx> connection member to
NULL. This may cause a crash as <ctx> is a dangling pointer after QCC
release. One of the possible reproducer is to activate QMUX traces,
which will cause a segfault on the qmux_init() error leave trace.
To fix this, simply reset <ctx> to NULL manually on qmux_init() failure.
Amaury Denoyelle [Mon, 17 Feb 2025 16:15:49 +0000 (17:15 +0100)]
BUG/MINOR: quic: prevent crash on conn access after MUX init failure
Initially, QUIC-MUX was responsible to reset quic_conn <conn> member to
NULL when MUX was released. This was performed via qcc_release().
However, qcc_release() is also used on qmux_init() failure. In this
case, connection must be freed via its session, so QCC <conn> member is
resetted to NULL prior to qcc_release(), which prevents quic_conn <conn>
member to also be resetted. As the connection is freed soon after,
quic_conn <conn> is a dangling pointer, which may cause crashes.
This bug should be very rare as first it implies that QUIC-MUX
initialization has failed (for example due to a memory alloc error).
Also, <conn> member is rarely used by quic_conn instance. In fact, the
only reproducible crash was done with QUIC traces activated, as in this
case connection is accessed via quic_conn under __trace_enabled()
function.
To fix this, detach connection from quic_conn via the XPRT layer instead
of the MUX. More precisely, this is performed via quic_close(). This
should ensure that it will always be conducted, either on normal
connection closure, but also after special conditions such as MUX init
failure.
The commented "hex" argument will also display full frames in hex (not
recommended). The connections are prefixed with a 3-hex digit number in
order to also support a bit of multiplexing without impacting the reading
too much. The screen is split in two, with the request on the left and
the response on the right. Here's an example of what it does between an
haproxy backend and an haproxy frontend both in H2, when submitted a
curl request for /?s=30k handled by httpterm:
It deserves some improvements. For instance at the moment it does not
verify the preface, any 24 bytes will work. It does not perform any
protocol validation either. Detecting some issues such as out-of-sequence
frames could be helpful. But it already helps as-is.
This patch implements the "crt" keywords in frontend, declaring an
implicit crt-list named after the frontend.
The patch is split in two steps:
The first step is the crt keyword parser, which parses crt lines and
fill a "cfg_crt_node" struct containing a ssl_bind_conf and a
ckch_conf which are put in a list to be used later.
After parsing the frontend section, as a 2nd step, a
post_section_parser is called, it will create a crt-list named after
the frontend and will fill it with certificates from the list of
cfg_crt_node. Once created this crt-list will be loaded in every "ssl"
bind lines that didn't declare any crt or crt-list.
MINOR: ssl/crtlist: split the ckch_conf loading from the crtlist line parsing
ckch_conf loading is not that simple as it requires to check
- if the cert already exists in the ckchs_tree
- if the ckch_conf is compatible with an existing cert in ckchs_tree
- if the cert is a bundle which need to load multiple ckch_store
This logic could be reuse elsewhere, so this commit introduce the new
crtlist_load_crt() function which does that.
BUG/MINOR: fcgi: Don't set the status to 302 if it is already set
When a "Location" header was found in a FCGI response, the status code was
forced to 302. But it should only be performed if no status code was set
first.
So now, we take care to not override an already defined status code when the
"Location" header is found.
This patch should fix the issue #2865. It must backported to all stable
versions.
BUG/MEDIUM: filters: Handle filters registered on data with no payload callback
An HTTP filter with no http_payload callback function may be registered on
data. In that case, this filter is obviously not called when some data are
received but it remains important to update its internal state to be sure to
keep it synchronized on the stream, especially its offet value. Otherwise,
the wrong calculation on the global offset may be performed in
flt_http_end(), leading to an integer overflow when data are moved from
input to output. This overflow triggers a BUG_ON() in c_adv().
The same is true for TCP filters with no tcp_payload callback function.
This patch must be backport to all stable versions.
BUG/MINOR: cli: Wait for the last ACK when FDs are xferred from the old worker
On reload, the new worker requests bound FDs to the old one. The old worker
sends them in message of at most 252 FDs. Each message is acknowledged by
the new worker. All messages sent or received by the old worker are handled
manually via sendmsg/recv syscalls. So the old worker must be sure consume
all the ACK replies. However, the last one was never consumed. So it was
considered as a command by the CLI applet. This issue was hidden since
recently. But it was the root cause of the issue #2862.
Note this last ack is also the first one when there are less than 252 FDs to
transfer.
This patch must be backported to all stable versions.
BUG/MEDIUM: cli: Be sure to drop all input data in END state
Commit 7214dcd ("BUG/MEDIUM: applet: Don't pretend to have more data to
handle EOI/EOS/ERROR") revealed a bug with the CLI applet. Pending input
data when the applet is in CLI_ST_END state were never consumed or dropped,
leading to a wakeup loop.
The CLI applet implements its own snd_buf callback function. It is important
it consumes all pending input data. Otherwise, the applet is woken up in
loop until it empties the request buffer. Another way to fix the issue would
be to report an error. But in that case, it seems reasonnable to drop these
data.
The issue can be observed on reload, in master/worker mode, because of issue
about the last ACK message which was never consummed by the _getsocks()
command.
This patch should fix the issue #2862. It must be backported to 3.1 with the
commit above.
BUG/MINOR: startup: hap_register_feature() fix for partial feature name
In patch 2fe4cbd8e ("MINOR: startup: allow hap_register_feature() to
enable a feature in the list"), the ability to overwrite a '-' in the
feature list was added. However the code was not tokenizing correctly
the string, and partial feature name found in the name could result in
having the same feature name multiple time.
This patch rewrites the lookup of the string by tokenizing it correctly.
Amaury Denoyelle [Wed, 12 Feb 2025 16:49:09 +0000 (17:49 +0100)]
MINOR: quic: support frame type as a varint
QUIC frame type is encoded as a variable-length integer. Thus, 64-bit
integer should be used for them. Currently, this was not the case as
type was represented as a 1-byte char inside quic_frame structure. This
does not cause any issue with QUIC from RFC9000, as all frame types fit
in this range. Furthermore, a QUIC implementation is required to use the
smallest size varint when encoding a frame type.
However, the current code is unable to accept QUIC extension with bigger
frame types. This is notably the case for quic-on-streams draft. Thus,
this commit readjusts quic_frame architecture to be able to support
higher frame type values.
First, type field of quic_frame is changed to a 64-bits variable. Both
encoding and decoding frame functions uses variable-length integer
helpers to manipulate the frame type field.
Secondly, the quic_frame builders/parsers infrastructure is still
preserved. However, it could be impossible to define new large frame
type as an index into quic_frame_builders / quic_frame_parsers arrays.
Thus, wrapper functions are now provided to access the builders and
parsers. Both qf_builder() and qf_parser() wrappers can then be extended
to return custom builder/parser instances for larger frame type.
Finally, unknown frame type detection also uses the new wrapper
quic_frame_is_known(). As with builders/parsers, for large frame type,
this function must be manually completed to support a new type value.
MINOR: startup: allow hap_register_feature() to enable a feature in the list
This patch allows hap_register_feature() to enable a feature in the list
which was already registered and marked disabled.
This way we could enable automatically some features under certain
condition without the need of the USE argument with make and correctly
report its activation.
Willy Tarreau [Thu, 13 Feb 2025 15:40:17 +0000 (16:40 +0100)]
DOC: watchdog: document the sequence of the watchdog and panic
Each time we go into the watchdog and panic code, it's super hard to
figure who calls what since signals are involved to bounce between
threads. Let's document the main principles and sequences to ease the
journey next time.
MINOR: ssl/cli: display more filenames in 'show ssl cert'
"show ssl cert <file>" only displays a unique filename, which is the
key used in the ckch_store tree. This patch extends it by displaying
every filenames from the ckch_conf that can be configured with the
crt-store.
In order to be more consistant, some changes are needed in the future:
- we need to store the complete path in the ckch_conf (meaning with
crt-path or key-path)
- we need to fill a ckch_conf in cases the files are autodiscovered
1d3c8223 ("MINOR: ssl: allow to change the server signature algorithm")
mplemented the sigals keyword in the crt-list but never the dump of the
keyword over the CLI.
b6ae2aafde43 ("MINOR: ssl: allow to change the signature algorithm for
client authentication") implemented the client-sigals keyword in the
crt-list but never the dump of the keyword over the CLI.
Willy Tarreau [Wed, 12 Feb 2025 15:25:13 +0000 (16:25 +0100)]
BUG/MEDIUM: fd: mark FD transferred to another process as FD_CLONED
The crappy epoll API stroke again with reloads and transferred FDs.
Indeed, when listening sockets are retrieved by a new worker from a
previous one, and the old one finally stops listening on them, it
closes the FDs. But in this case, since the sockets themselves were
not closed, epoll will not unregister them and will continue to
report new activity for these in the old process, which can only
observe, count an fd_poll_drop event and not unregister them since
they're not reachable anymore.
The unfortunate effect is that long-lasting old processes are woken
up at the same rate as the new process when accepting new connections,
and can waste a lot of CPU. Accept rates divided by 8 were observed on
a small test involving a slow transfer on 10 connections facing a reload
every second so that 10 processes were busy dealing with them while
another process was hammering the service with new connections.
Fortunately, years ago we implemented a flag FD_CLONED exactly for
similar purposes. Let's simply mark transferred FDs with FD_CLONED
so that the process knows that these ones require special treatment
and have to be manually unregistered before being closed. This does
the job fine, now old processes correctly unregister the FD before
closing it and no longer receive accept events for the new process.
This needs to be backported to all stable versions. It only affects
epoll, as usual, and this time in combination with transferred FDs
(typically reloads in master-worker mode). Thanks to Damien Claisse
for providing all detailed measurements and statistics allowing to
understand and reproduce the problem.
Amaury Denoyelle [Mon, 10 Feb 2025 10:28:35 +0000 (11:28 +0100)]
MINOR: quic: refactor CRYPTO encoding and splitting
This patch is the direct follow-up of the previous one which refactor
STREAM frame encoding. Reuse the newly defined quic_strm_frm_fillbuf()
and quic_strm_frm_split() functions for CRYPTO frame encoding.
The code for CRYPTO and STREAM frames encoding should now be clearer as
it is mostly identical.
MINOR: quic: refactor STREAM encoding and splitting
CRYPTO and STREAM frames encoding is similar. If payload is too large,
frame will be splitted and only the first payload part will be written
in the output QUIC packet. This process is complexified by the presence
of a variable-length integer Length field prior to the payload.
This commit aims at refactor these operations. Define two functions to
simplify the code :
* quic_strm_frm_fillbuf() which is used to calculate the optimal frame
length of a STREAM/CRYPTO frame with its payload in a buffer
* quic_strm_frm_split() which is used to split the frame payload if
buffer is too small
With this patch, both functions are now implemented for STREAM encoding.
MEDIUM: initcall: allow to register mutiple post_section_parser per section
Before this patch, REGISTER_CONFIG_SECTION() allowed to register one and only
one callback (<post>) called after the parsing of a section.
It was limitating because you couldn't register a post callback from anywhere
else in the code.
This patch introduces the new REGISTER_CONFIG_SECTION_POST() macros which allows
to register a new post callback for a section keyword from anywhere.
This patch introduces the feature by allowing `struct cfg_section` entries that
does not have a `section_parser`, and then iterating on all cfg_section with a
post_section_parser for a keyword.
BUG/MINOR: mworker: post_section_parser for the last section in discovery
Previous patch 2c270a05f ("BUG/MINOR: mworker: section ignored in
discovery after a post_section_parser") needs an adjustment for the last
section of the file.
Indeed the post_section_parser of the last section must not be called in
discovery mode.
BUG/MINOR: mworker: section ignored in discovery after a post_section_parser
When a new section is discovered, the post_section_parser of the
previous section is called. However in the new master-worker mode the
discovery mode will skip the post_section_parser. But instead of
trying to parse the current section keyword after that, it would skip
completely the current line.
This is a minor bug since there isn't a lot of section with
post_section_parser, and not a lot of section to parse in discovery
mode.
But this could be reproduced like this:
global
expose-deprecated-directives
resolvers res
parse-resolv-conf
program foo
command sleep 10
program bar
command sleep 10
Ths 'resolvers' section has a post_section_parser which will be ignored
in discovery mode with the consequence of ignoring the first program
section.
Amaury Denoyelle [Wed, 12 Feb 2025 09:55:51 +0000 (10:55 +0100)]
MINOR: quic: simplify length calculation for STREAM/CRYPTO frames
STREAM and CRYPTO frames have a similar encoding format. In particular,
both of them have a variable-length integer Length field just before the
frame payload.
It is complex to determine the optimal Length value before copying the
payload data in the remaining buffer space. As such, helper functions
were implemented to calculate this. However, CRYPTO and STREAM frames
encoding implementation were not completely aligned, which renders the
code harder to follow.
The purpose of this commit is to simplify CRYPTO and STREAM frames
encoding. First, a new helper quic_int_cap_length() is defined which is
useful to determine the optimal buffer room available if prefixed by a
variable-length integer as Length field. Then, processing of both CRYPTO
and STREAM frames is now nearly identical, based on this new helper
function. Functions max_available_room() and max_stream_data_size() are
now unused and are removed.
Amaury Denoyelle [Tue, 11 Feb 2025 13:35:52 +0000 (14:35 +0100)]
BUG/MINOR: quic: fix CRYPTO payload size calcul for encoding
Function max_stream_data_size() is used to determine the payload length
of a CRYPTO frame. It takes into account that the CRYPTO length field is
a variable length integer.
Implemented calcul was incorrect as it reserved too much space as a
frame header. This error is mostly due because max_stream_data_size()
reuses max_available_room() which also reserve space for a variable
length integer. This results in CRYPTO frames shorter of 1 to 2 bytes
than the maximum achievable value, which produces in the end datagram
shorter than the MTU.
Fix max_stream_data_size() implementation. It is now merely a wrapper on
max_available_room(). This ensures that CRYPTO frame encoding is now
properly optimized to use the MTU available.
Amaury Denoyelle [Tue, 11 Feb 2025 13:34:57 +0000 (14:34 +0100)]
BUG/MINOR: quic: reserve length field for long header encoding
Long header packets have a mandatory Length field, which contains the
size of Packet number and payload, encoded as a variable-length integer.
Its value can thus only be determined after the payload size is known,
which depends on the remaining buffer space after this variable-length
field.
Packet payload are encoded in two steps. First, a list of input frames
is processed until the packet buffer is full. CRYPTO and STREAM frames
payload can be splitted if need to fill the buffer. Real encoding is
then performed as a second stage operation, first with Length field,
then with the selected frames themselves.
Before this patch, no space was reserved in the buffer for Length field
when attaching the frames to the packet. This could result in a error as
the packet payload would be too large for the remaining space.
In practice, this issue was rarely encounted, mostly as a side-effect
from another issue linked to CRYPTO frame encoding. Indeed, a wrong
calculation is performed on CRYPTO splitting, which results in frame
payload shorter by a few bytes than expected. This however ensured there
would be always enough room for the Length field and payload during
encoding. As CRYPTO frames are the only big enough content emitted with
a Long header packet, this renders the current issue mostly non
reproducible.
Fix the original issue by reserving some space for Length field prior to
frame payload calculation, using a maximum value based on the remaining
room space. Packet length is then reduced if needed when encoding is
performed, which ensures there is always enough room for the selected
frames.
Note that the other issue impacting CRYPTO frame encoding is not yet
fixed. This could result in datagrams with Long header packets not
completely extended to the full MTU. The issue will be addressed in
another patch.
Willy Tarreau [Tue, 11 Feb 2025 16:24:19 +0000 (17:24 +0100)]
MAJOR: leastconn: postpone the server's repositioning under contention
When leastconn is used under many threads, there can be a lot of
contention on leastconn, because the same node has to be moved around
all the time (when picking it and when releasing it). In GH issue #2861
it was noticed that 46 threads out of 64 were waiting on the same lock
in fwlc_srv_reposition().
In such a case, the accuracy of the server's key becomes quite irrelevant
because nobody cares if the same server is picked twice in a row and the
next one twice again.
While other approaches in the past considered using a floating key to
avoid moving the server each time (which was not compatible with the
round-robin rule for equal keys), here a more drastic solution is needed.
What we're doing instead is that we turn this lock into a trylock. If we
can grab it, we do the job. If we can't, then we just wake up a server's
tasklet dedicated to this. That tasklet will then try again slightly
later, knowing that during this short time frame, the server's position
in the queue is slightly inaccurate. Note that any thread touching the
same server will also reposition it and save that work for next time.
Also if multiple threads wake the tasklet up, then that's fine, their
calls will be merged and a single lock will be taken in the end.
Testing this on a 24-core EPYC 74F3 showed a significant performance
boost from 382krps to 610krps. The performance profile reported by
perf top dropped from 43% to 2.5%:
The cost of fwlc_get_next_server() naturally increases as the server
count increases, but now has no visible effect on updates. The load
distribution remains unchanged compared to the previous approach,
the weight still being respected.
For further improvements to the fwlc algo, please consult github
issue #881 which centralizes everything related to this algorithm.
Willy Tarreau [Tue, 11 Feb 2025 16:18:36 +0000 (17:18 +0100)]
MEDIUM: server: allocate a tasklet for asyncronous requeuing
This creates a tasklet that only expects to be called when the LB
algorithm is under contention when trying to reposition the server
in its tree. Indeed, that's one of the operations that usually
requires to take a write lock on a highly contended area, often
for very little benefits under contention; indeed, under load, if
a server keeps its previous position for a few extra microseconds,
usually there's no harm. Thus this new tasklet can be woken up by
the LB algo to ask the server to later call lbprm.server_requeue().
It does nothing else.
Willy Tarreau [Tue, 11 Feb 2025 16:16:14 +0000 (17:16 +0100)]
MINOR: lbprm: add a new callback ->server_requeue to the lbprm
This callback will be used to reposition a server to its expected
position regardless of the fact that it was taken or dropped. It
will only be used by supporting LB algos. For now, only fwlc defines
it and assigns it to fwlc_srv_reposition(). At the moment it's not
used yet.
Willy Tarreau [Mon, 10 Feb 2025 10:15:44 +0000 (11:15 +0100)]
DEBUG: thread: reduce the struct lock_stat to store only 30 buckets
Storing only 30 buckets means we only keep 256 bytes per label. This
further simplifies address calculation and reduces the memory used
without complicating the locking code. It means we won't measure wait
times larger than a second but we're not supposed to face this as it
would trigger the watchdog anyway. It may become a little bit just if
measuring using rdtsc() instead of now_mono_time() though (typically
the limit would be around 350ms for a 3 GHz CPU).
Willy Tarreau [Mon, 10 Feb 2025 10:08:45 +0000 (11:08 +0100)]
DEBUG: thread: make lock_stat per operation instead of for all operations
It's more convenient (and more readable) to have the lock stats arranged
by operation type (read, seek, write). It will also allow to later simplify
the structure format and the bucket address calculation. Now lock_stat[]
got split into lock_stats_rd[], lock_stats_sk[], lock_stats_wr[].
Willy Tarreau [Mon, 10 Feb 2025 09:56:44 +0000 (10:56 +0100)]
DEBUG: thread: don't keep the redundant _locked counter
Now that we have our sums by bucket, the _locked counter is redundant
since it's always equal to the sum of all entries. Let's just get rid
of it and replace its consumption with a loop over all buckets, this
will reduce the overhead of taking each lock at the expense of a tiny
extra effort when dumping all locks, which we don't care about.
Willy Tarreau [Mon, 10 Feb 2025 09:41:35 +0000 (10:41 +0100)]
DEBUG: thread: report the wait time buckets for lock classes
In addition to the total/average wait time, we now also store the
wait time in 2^N buckets. There are 32 buckets for each type (read,
seek, write), allowing to store wait times from 1-2ns to 2.1-4.3s,
which is quite sufficient, even if we'd want to switch from NS to
CPU cycles in the future. The counters are only reported for non-
zero buckets so as not to visually pollute the output.
This significantly inflates the lock_stat struct, which is now
aligned to 256 bytes and rounded up to 1kB. But that's not really
a problem, given that there's only one per lock label.
Willy Tarreau [Mon, 10 Feb 2025 08:26:04 +0000 (09:26 +0100)]
DEBUG: thread: make lock time computation more consistent
The lock time computation was a bit inconsistent between functions,
particularly those using a try_lock. Some of them would count the lock
as taken without counting the time, others would simply not count it.
This is essentially due to the way the time is retrieved, as it was
done inside the atomic increment.
Let's instead always use start_time to carry the elapsed time, by
presetting it to the negative time before the event and addinf the
positive time after, so that it finally contains the duration. Then
depending on the try lock's success, we add the result or not. This
was generalized to all lock functions for consistency, and because
this will be handy for future changes.