BUG/MINOR: tcpcheck: Be able to expect an empty response
It is not possible to successfully match an empty response. However using
regex, it should be possible to reject response with any content. For
instance:
tcp-check expect !rstring ".+"
It may seem a be strange to do that, but it is possible, it is a valid
config. So it must work. Thanks to this patch, it is now really supported.
This patch may be backported as far as 2.2. But only if someone ask for it.
This this commit, this is ->idle_expire of quic_conn struct which must
be taken into an account to display the idel timer task expiration value:
MEDIUM: quic: Ack delay implementation
Furthermore, this value was always zero until now_ms has wrapped (20 s after
the start time) due to this commit:
MEDIUM: clock: force internal time to wrap early after boot
Do not rely on the value of now_ms compared to ->idle_expire to display the
difference but use ticks_remain() to compute it.
Must be backported to 2.7 where "show quic" has already been backported.
BUG/MINOR: quic: Unexpected connection closures upon idle timer task execution
This bug arrived with this commit:
MEDIUM: quic: Ack delay implementation
It is possible that the idle timer task was already in the run queue when its
->expire field was updated calling qc_idle_timer_do_rearm(). To prevent this
task from running in this condition, one must check its ->expire field value
with this condition to run the task if its timer has really expired:
!tick_is_expired(t->expire, now_ms)
Furthermore, as this task may be directly woken up with a call to task_wakeup()
all, for instance by qc_kill_conn() to kill the connection, one must check this
task has really been woken up when it was in the wait queue and not by a direct
call to task_wakeup() thanks to this test:
(state & TASK_WOKEN_ANY) == TASK_WOKEN_TIMER
Again, when this condition is not fulfilled, the task must be run.
Must be backported where the commit mentionned above was backported.
MINOR: http-act: emit a warning when a header field name contains forbidden chars
As found in issue #2089, it's easy to mistakenly paste a colon in a
header name, or other chars (e.g. spaces) when quotes are in use, and
this causes all sort of trouble in field because such chars are rejected
by the peer.
Better try to detect these upfront. That's what we're doing here during
the parsing of the add-header/set-header/early-hint actions, where a
warning is emitted if a non-token character is found in a header name.
A special case is made for the colon at the beginning so that it remains
possible to place any future pseudo-headers that may appear. E.g:
[WARNING] (14388) : config : parsing [badchar.cfg:23] : header name 'X-Content-Type-Options:' contains forbidden character ':'.
This should be backported to 2.7, and ideally it should be turned to an
error in future versions.
BUG/MINOR: quic: Remove useless BUG_ON() in newreno and cubic algo implementation
As now_ms may be zero, these BUG_ON() could be triggered when its value has wrapped.
These call to BUG_ON() may be removed because the values they was supposed to
check are safely used by the ticks API.
BUG/MINOR: ssl: Undefined reference when building with OPENSSL_NO_DEPRECATED
If OPENSSL_NO_DEPRECATED is set, we get a 'error: ‘RSA_PKCS1_PADDING’
undeclared' when building jwt.c. The symbol is not deprecated, we are
just missing an include.
This was raised in GitHub issue #2098.
It does not need to be backported.
BUG/MAJOR: quic: Congestion algorithms states shared between the connection
This very old bug is there since the first implementation of newreno congestion
algorithm implementation. This was a very bad idea to put a state variable
into quic_cc_algo struct which only defines the congestion control algorithm used
by a QUIC listener, typically its type and its callbacks.
This bug could lead to crashes since BUG_ON() calls have been added to each algorithm
implementation. This was revealed by interop test, but not very often as there was
not very often several connections run at the time during these tests.
Hopefully this was also reported by Tristan in GH #2095.
Move the congestion algorithm state to the correct structures which are private
to a connection (see cubic and nr structs).
BUG/MINOR: quic: Remaining useless statements in cubic slow start callback
When entering a recovery period, the algo state is set by quic_enter_recovery().
And that's it!. These two lines should have been removed with this commit:
BUG/MINOR: quic: Wrong use of now_ms timestamps (cubic algo)
Take the opportunity of this patch to add a missing TRACE_LEAVE() call in
quic_cc_cubic_ca_cb().
Willy Tarreau [Fri, 31 Mar 2023 14:33:53 +0000 (16:33 +0200)]
MINOR: cli: support filtering on FD types in "show fd"
Depending on what we're debugging, some FDs can represent pollution in
the "show fd" output. Here we add a set of filters allowing to pick (or
exclude) any combination of listener, frontend conn, backend conn, pipes,
etc. "show fd l" will only list listening connections for example.
In ->srtt quic_loss struct this is 8*srtt which is stored so that not to have to multiply/devide
it to compute the RTT variance (at least). This is where there was a bug in quic_loss_srtt_update():
each time ->srtt must be used, it must be devided by 8 or right shifted by 3.
This bug had a very bad impact for network with non negligeable packet loss.
Reuse the idle timeout task to delay the acknowledgments. The time of the
idle timer expiration is for now on stored in ->idle_expire. The one to
trigger the acknowledgements is stored in ->ack_expire.
Add QUIC_FL_CONN_ACK_TIMER_FIRED new connection flag to mark a connection
as having its acknowledgement timer been triggered.
Modify qc_may_build_pkt() to prevent the sending of "ack only" packets and
allows the connection to send packet when the ack timer has fired.
It is possible that acks are sent before the ack timer has triggered. In
this case it is cancelled only if ACK frames are really sent.
The idle timer expiration must be set again when the ack timer has been
triggered or when it is cancelled.
Dump variables displayed by TRACE_ENTER() or TRACE_LEAVE() by calls to TRACE_PROTO().
No more variables are displayed by the two former macros. For now on, these information
are accessible from proto level.
Add new calls to TRACE_PROTO() at important locations in relation whith QUIC transport
protocol.
When relevant, try to prefix such traces with TX or RX keyword to identify the
concerned subpart (transmission or reception) of the protocol.
This callback was left as not implemented. It should at least display
the algorithm state, the control congestion window the slow start threshold
and the time of the current recovery period. Should be helpful to debug.
BUG/MINOR: quic: Missing max_idle_timeout initialization for the connection
This bug was revealed by handshakeloss interop tests (often with quiceh) where one
could see haproxy an Initial packet without TLS ClientHello message (only a padded
PING frame). In this case, as the ->max_idle_timeout was not initialized, the
connection was closed about three seconds later, and haproxy opened a new one with
a new source connection ID upon receipt of the next client Initial packet. As the
interop runner count the number of source connection ID used by the server to check
there were exactly 50 such IDs used by the server, it considered the test as failed.
So, the ->max_idle_timeout of the connection must be at least initialized
to the local "max_idle_timeout" transport parameter value to avoid such
a situation (closing connections too soon) until it is "negotiated" with the
client when receiving its TLS ClientHello message.
BUG/MINOR: quic: Wrong use of now_ms timestamps (newreno algo)
This patch is similar to the one for cubic algorithm:
"BUG/MINOR: quic: Wrong use of timestamps with now_ms variable (cubic algo)"
As now_ms may wrap, one must use the ticks API to protect the cubic congestion
control algorithm implementation from side effects due to this.
Furthermore, to make the newreno congestion control algorithm more readable and easy
to maintain, add quic_cc_cubic_rp_cb() new callback for the "in recovery period"
state (QUIC_CC_ST_RP).
MINOR: quic: Add recovery related information to "show quic"
Add ->srtt, ->rtt_var, ->rtt_min and ->pto_count values from ->path->loss
struct to "show quic". Same thing for ->cwnd from ->path struct.
Also take the opportunity of this patch to dump the packet number
space information directly from ->pktns[] array in place of ->els[]
array. Indeed, ->els[QUIC_TLS_ENC_LEVEL_EARLY_DATA] and ->els[QUIC_TLS_ENC_LEVEL_APP]
have the same packet number space.
Must be backported to 2.7 where "show quic" implementation has alredy been
backported.
BUG/MINOR: quic: Wrong use of now_ms timestamps (cubic algo)
As now_ms may wrap, one must use the ticks API to protect the cubic congestion
control algorithm implementation from side effects due to this.
Furthermore to make the cubic congestion control algorithm more readable and easy
to maintain, adding a new state ("in recovery period" QUIC_CC_ST_RP new enum) helps
in reaching this goal. Implement quic_cc_cubic_rp_cb() which is the callback for
this new state.
BUG/MINOR: ssl: ssl-(min|max)-ver parameter not duplicated for bundles in crt-list
If a bundle is used in a crt-list, the ssl-min-ver and ssl-max-ver
options were not taken into account in entries other than the first one
because the corresponding fields in the ssl_bind_conf structure were not
copied in crtlist_dup_ssl_conf.
This should fix GitHub issue #2069.
This patch should be backported up to 2.4.
BUG/MINOR: ssl: Fix potential leak in cli_parse_update_ocsp_response
In some extremely unlikely case (or even impossible for now), we might
exit cli_parse_update_ocsp_response without raising an error but with a
filled 'err' buffer. It was not properly free'd.
BUG/MINOR: ssl: Remove dead code in cli_parse_update_ocsp_response
This patch removes dead code from the cli_parse_update_ocsp_response
function. The 'end' label in only used in case of error so the check of
the 'errcode' variable and the errcode variable itself become useless.
This patch does not need to be backported.
It fixes GitHub issue #2077.
BUG/MINOR: backend: make be_usable_srv() consistent when stopping
When a proxy enters the STOPPED state, it will no longer accept new
connections.
However, it doesn't mean that it's completely inactive yet: it will
still be able to handle already pending / keep-alive connections,
thus finishing ongoing work before effectively stopping.
be_usable_srv(), which is used by nbsrv converter and sample fetch,
will return 0 if the proxy is either stopped or disabled.
nbsrv behaves this way since it was originally implemented in b7e7c4720
("MINOR: Add nbsrv sample converter").
(Since then, multiple refactors were performed around this area, but
the current implementation still follows the same logic)
It was found that if nbsrv is used in a proxy section to perform routing
logic, unexpected decisions are being made when nbsrv is used on a proxy
with STOPPED state, since in-flight requests will suffer from nbsrv
returning 0 instead of the current number of usable servers which may
still process existing connections.
For instance, this can happen during process soft-stop, or even when
stopping the proxy from the cli / lua.
To fix this: we now make sure be_usable_srv() always returns the
current number of usable servers, unless the proxy is explicitly
disabled (from the config, not at runtime)
This could be backported up to 2.6.
For older versions, the need for a backport should be evaluated first.
--
Note for 2.4: proxy flags did not exist, it was implemented with fd10ab5e
("MINOR: proxy: Introduce proxy flags to replace disabled bitfield")
For 2.2: STOPPED and DISABLED states were not separated, so we have no
easy way to apply the fix anyway.
BUG/MEDIUM: proxy/sktable: prevent watchdog trigger on soft-stop
During soft-stop, manage_proxy() (p->task) will try to purge
trashable (expired and not referenced) sticktable entries,
effectively releasing the process memory to leave some space
for new processes.
This is done by calling stktable_trash_oldest(), immediately
followed by a pool_gc() to give the memory back to the OS.
As already mentioned in dfe7925 ("BUG/MEDIUM: stick-table:
limit the time spent purging old entries"), calling
stktable_trash_oldest() with a huge batch can result in the function
spending too much time searching and purging entries, and ultimately
triggering the watchdog.
Lately, an internal issue was reported in which we could see
that the watchdog is being triggered in stktable_trash_oldest()
on soft-stop (thus initiated by manage_proxy())
According to the report, the crash seems to only occur since 5938021
("BUG/MEDIUM: stick-table: do not leave entries in end of window during purge")
This could be the result of stktable_trash_oldest() now working
as expected, and thus spending a large amount of time purging
entries when called with a large enough <to_batch>.
Instead of adding new checks in stktable_trash_oldest(), here we
chose to address the issue directly in manage_proxy().
Since the stktable_trash_oldest() function is called with
<to_batch> == <p->table->current>, it's pretty obvious that it could
cause some issues during soft-stop if a large table, assuming it is
full prior to the soft-stop, suddenly sees most of its entries
becoming trashable because of the soft-stop.
Moreover, we should note that the call to stktable_trash_oldest() is
immediately followed by a call to pool_gc():
We know for sure that pool_gc(), as it involves malloc_trim() on
glibc, is rather expensive, and the more memory to reclaim,
the longer the call.
We need to ensure that both stktable_trash_oldest() + consequent
pool_gc() call both theoretically fit in a single task execution window
to avoid contention, and thus prevent the watchdog from being triggered.
To do this, we now allocate a "budget" for each purging attempt.
budget is maxed out to 32K, it means that each sticktable cleanup
attempt will trash at most 32K entries.
32K value is quite arbitrary here, and might need to be adjusted or
even deducted from other parameters if this fails to properly address
the issue without introducing new side-effects.
The goal is to find a good balance between the max duration of each
cleanup batch and the frequency of (expensive) pool_gc() calls.
If most of the budget is actually spent trashing entries, then the task
will immediately be rescheduled to continue the purge.
This way, the purge is effectively batched over multiple task runs.
This may be slowly backported to all stable versions.
[Please note that this commit depends on 6e1fe25 ("MINOR: proxy/pool:
prevent unnecessary calls to pool_gc()")]
Martin DOLEZ [Tue, 28 Mar 2023 13:06:05 +0000 (09:06 -0400)]
MINOR: http_fetch: Add case-insensitive argument for url_param/urlp_val
This commit adds a new optional argument to smp_fetch_url_param
and smp_fetch_url_param_val that makes the parameter key comparison
case-insensitive.
Now users can retrieve URL parameters regardless of their case,
allowing to match parameters in case insensitive application.
Doc was updated.
Martin DOLEZ [Tue, 28 Mar 2023 13:06:05 +0000 (09:06 -0400)]
MINOR: http_fetch: add case insensitive support for smp_fetch_url_param
This commit adds a new argument to smp_fetch_url_param
that makes the parameter key comparison case-insensitive.
Several levels of callers were modified to pass this info.
Martin DOLEZ [Tue, 28 Mar 2023 13:49:53 +0000 (09:49 -0400)]
MINOR: http_fetch: Add support for empty delim in url_param
In prevision of adding a third parameter to the url_param
sample-fetch function we need to make the second parameter optional.
User can now pass a empty 2nd argument to keep the default delimiter.
DOC/MINOR: reformat configuration.txt's "quoting and escaping" table
The table in section 2.2 ("Quoting and escaping") was formated in a way
which is not recognized by haproxy-dconv, breaking it, and cutting off
the entire section.
This commit fix that by formatting the table in way which allows the
converter to produce the correct HTML.
CLEANUP: proxy: remove stop_time related dead code
Since eb77824 ("MEDIUM: proxy: remove the deprecated "grace" keyword"),
stop_time is never set, so the related code in manage_proxy() is not
relevant anymore.
Removing code that refers to p->stop_time, since it was probably
overlooked.
MINOR: proxy/pool: prevent unnecessary calls to pool_gc()
Under certain soft-stopping conditions (ie: sticktable attached to proxy
and in-progress connections to the proxy that prevent haproxy from
exiting), manage_proxy() (p->task) will wake up every second to perform
a cleanup attempt on the proxy sticktable (to purge unused entries).
However, as reported by TimWolla in GH #2091, it was found that a
systematic call to pool_gc() could cause some CPU waste, mainly
because malloc_trim() (which is rather expensive) is being called
for each pool_gc() invocation.
As a result, such soft-stopping process could be spending a significant
amount of time in the malloc_trim->madvise() syscall for nothing.
Example "strace -c -f -p `pidof haproxy`" output (taken from
Tim's report):
To prevent this, we now only call pool_gc() when some memory is
really expected to be reclaimed as a direct result of the previous
stick table cleanup.
This is pretty straightforward since stktable_trash_oldest() returns
the number of trashed sticky sessions.
BUG/MINOR: quic: Missing padding in very short probe packets
This bug arrived with this commit:
MINOR: quic: Send PING frames when probing Initial packet number space
This may happen when haproxy needs to probe the peer with very short packets
(only one PING frame). In this case, the packet must be padded. There was clearly
a case which was removed by the mentionned commit above. That said, there was
an extra byte which was added to the PADDING frame before the mentionned commit
above. This is no more the case with this patch.
Thank you to @tatsuhiro-t (ngtcp2 manager) for having reported this issue which
was revealed by the keyupdate test (on client side).
BUG/MEDIUM: mux-h2: Be able to detect connection error during handshake
When a backend H2 connection is waiting the connection is fully established,
nothing is sent. However, it remains useful to detect connection error at
this stage. It is especially important to release H2 connection on connect
error. Be able to set H2_CF_ERR_PENDiNG or H2_CF_ERROR flags when the
underlying connection is not fully established will exclude the H2C to be
inserted in a idle list in h2_detach().
Without this fix, an H2C in PREFACE state and relying on a connection in
error can be inserted in the safe list. Of course, it will be purged if not
reused. But in the mean time, it can be reused. When this happens, the
connection remains in error and nothing happens. At the end a connection
error is returned to the client. On low traffic, we can imagine a scenario
where this dead connection is the only idle connection. If it is always
reused before being purged, no connection to the server is possible.
In addition, h2c_is_dead() is updated to declare as dead any H2 connection
with a pending error if its state is PREFACE or SETTINGS1 (thus if no
SETTINGS frame was received yet).
This patch should fix the issue #2092. It must be backported as far as 2.6.
BUG/MINOR: stats: Don't replace sc_shutr() by SE_FL_EOS flag yet
In commit c2c043ed4 ("BUG/MEDIUM: stats: Consume the request except when
parsing the POST payload"), a change about applet was pushed too early. The
applet must still call cf_shutr() when the response is fully sent. It is
planned to rely on SE_FL_EOS flag, just like connections. But it is not
possible for now.
However, at first glance, this bug has no visible effect.
Willy Tarreau [Tue, 28 Mar 2023 11:58:56 +0000 (13:58 +0200)]
[RELEASE] Released version 2.8-dev6
Released version 2.8-dev6 with the following main changes :
- BUG/MEDIUM: mux-pt: Set EOS on error on sending path if read0 was received
- MINOR: ssl: Change the ocsp update log-format
- MINOR: ssl: Use ocsp update task for "update ssl ocsp-response" command
- BUG/MINOR: ssl: Fix double free in ocsp update deinit
- MINOR: ssl: Accept certpath as param in "show ssl ocsp-response" CLI command
- MINOR: ssl: Add certificate path to 'show ssl ocsp-response' output
- BUG/MEDIUM: proxy: properly stop backends on soft-stop
- BUG/MEDIUM: resolvers: Properly stop server resolutions on soft-stop
- DEBUG: cli/show_fd: Display connection error code
- DEBUG: ssl-sock/show_fd: Display SSL error code
- BUG/MEDIUM: mux-h1: Don't block SE_FL_ERROR if EOS is not reported on H1C
- BUG/MINOR: tcp_sample: fix a bug in fc_dst_port and fc_dst_is_local sample fetches
- BUG/MINOR: quic: Missing STREAM frame length updates
- BUG/MEDIUM: connection: Preserve flags when a conn is removed from an idle list
- BUG/MINOR: mux-h2: make sure the h2c task exists before refreshing it
- MINOR: buffer: add br_count() to return the number of allocated bufs
- MINOR: buffer: add br_single() to check if a buffer ring has more than one buf
- BUG/MEDIUM: mux-h2: only restart sending when mux buffer is decongested
- BUG/MINOR: mux-h2: set CO_SFL_STREAMER when sending lots of data
- BUG/MINOR: quic: Missing STREAM frame data pointer updates
- MINOR: stick-table: add sc-add-gpc() to http-after-response
- MINOR: doc: missing entries for sc-add-gpc()
- BUG/MAJOR: qpack: fix possible read out of bounds in static table
- OPTIM: mux-h1: limit first read size to avoid wrapping
- MINOR: mux-h2: set CO_SFL_MSG_MORE when sending multiple buffers
- MINOR: ssl-sock: pass the CO_SFL_MSG_MORE info down the stack
- MINOR: quic: Stop stressing the acknowledgments process (RX ACK frames)
- BUG/MINOR: quic: Dysfunctional 01RTT packet number space probing
- BUG/MEDIUM: stream: do not try to free a failed stream-conn
- BUG/MEDIUM: mux-h2: do not try to free an unallocated h2s->sd
- BUG/MEDIUM: mux-h2: erase h2c->wait_event.tasklet on error path
- BUG/MEDIUM: stconn: don't set the type before allocation succeeds
- BUG/MINOR: stconn: fix sedesc memory leak on stream allocation failure
- MINOR: dynbuf: set POOL_F_NO_FAIL on buffer allocation
- MINOR: pools: preset the allocation failure rate to 1% with -dMfail
- BUG/MEDIUM: mux-h1: properly destroy a partially allocated h1s
- BUG/MEDIUM: applet: only set appctx->sedesc on successful allocation
- BUG/MINOR: quic: wake up MUX on probing only for 01RTT
- BUG/MINOR: quic: ignore congestion window on probing for MUX wakeup
- BUILD: thread: implement thread_harmless_end_sig() for threadless builds
- BUILD: thread: silence a build warning when threads are disabled
- MINOR: debug: support dumping the libs addresses when running in verbose mode
- BUG/MINOR: illegal use of the malloc_trim() function if jemalloc is used
- BUG/MINOR: trace: fix hardcoded level for TRACE_PRINTF
- BUG/MEDIUM: mux-quic: release data from conn flow-control on qcs reset
- MINOR: mux-quic: complete traces for qcs emission
- MINOR: mux-quic: adjust trace level for MAX_DATA/MAX_STREAM_DATA recv
- MINOR: mux-quic: add flow-control info to minimal trace level
- MINOR: pools: make sure 'no-memory-trimming' is always used
- MINOR: pools: intercept malloc_trim() instead of trying to plug holes
- MEDIUM: pools: move the compat code from trim_all_pools() to malloc_trim()
- MINOR: pools: export trim_all_pools()
- MINOR: pattern: use trim_all_pools() instead of a conditional malloc_trim()
- MINOR: tools: relax dlopen() on malloc/free checks
- MEDIUM: tools: further relax dlopen() checks too consider grouped symbols
- BUG/MINOR: pools: restore detection of built-in allocator
- MINOR: pools: report a replaced memory allocator instead of just malloc_trim()
- BUG/MINOR: h3: properly handle incomplete remote uni stream type
- BUG/MINOR: mux-quic: prevent CC status to be erased by shutdown
- MINOR: mux-quic: interrupt qcc_recv*() operations if CC scheduled
- MINOR: mux-quic: ensure CONNECTION_CLOSE is scheduled once per conn
- MINOR: mux-quic: close on qcs allocation failure
- MINOR: mux-quic: close on frame alloc failure
- BUG/MINOR: syslog: Request for more data if message was not fully received
- BUG/MEDIUM: stats: Consume the request except when parsing the POST payload
- DOC: config: set-var() dconv rendering issues
- BUG/MEDIUM: mux-h1: Wakeup H1C on shutw if there is no I/O subscription
- BUG/MINOR: applet/new: fix sedesc freeing logic
- BUG/MINOR: quic: Missing STREAM frame type updated
- BUILD: da: extends CFLAGS to support API v3 from 3.1.7 and onwards.
- BUG/MINOR: ssl: Stop leaking `err` in ssl_sock_load_ocsp()
David Carlier [Tue, 28 Mar 2023 05:24:49 +0000 (06:24 +0100)]
BUILD: da: extends CFLAGS to support API v3 from 3.1.7 and onwards.
Minor build update to still both support the v2 and v3 api from
the 3.1.7 release which supports a cache but would need a shift
in the HAProxy build not necessary at the moment.
In the second half of the year and for the next major HAProxy release
branch, v2 could be dropped altogether thus the next HAProxy 2.9
major release will contain more changes towards the v3 support
and reminder for the v2 EOL.
BUG/MINOR: quic: Missing STREAM frame type updated
This patch follows this commit which was not sufficient:
BUG/MINOR: quic: Missing STREAM frame data pointer updates
Indeed, after updating the ->offset field, the bit which informs the
frame builder of its presence must be systematically set.
This bug was revealed by the following BUG_ON() from
quic_build_stream_frame() :
bug condition "!!(frm->type & 0x04) != !!stream->offset.key" matched at src/quic_frame.c:515
This should fix the last crash occured on github issue #2074.
Since 465a6c8 ("BUG/MEDIUM: applet: only set appctx->sedesc on
successful allocation"), sedesc is attached to the appctx after the
task is successfully allocated.
If the task fails to allocate: current sedesc cleanup is performed
on appctx->sedesc which still points to NULL so sedesc won't be
freed.
This is fine when sedesc is provided as argument (!=NULL), but leads
to memory leaks if sedesc is allocated locally.
It was shown in GH #2086 that if sedesc != NULL when passed as
argument, it shouldn't be freed on error paths. This is what 465a6c8
was trying to address.
In an attempt to fix both issues at once, we do as Christopher
suggested: that is moving sedesc allocation attempt at the
end of the function, so that we don't have to free it in case
of error, thus removing the ambiguity.
(We won't risk freeing a sedesc that does not belong to us)
If we fail to allocate sedesc, then the task that was previously
created locally is simply destroyed.
This needs to be backported to 2.6 with 465a6c8 ("BUG/MEDIUM: applet:
only set appctx->sedesc on successful allocation")
[Copy pasting the original backport note from Willy:
In 2.6 the function is slightly
different and called appctx_new(), though the issue is exactly the
same.]
BUG/MEDIUM: mux-h1: Wakeup H1C on shutw if there is no I/O subscription
This old bug was revealed because of the commit 407210a34 ("BUG/MEDIUM:
stconn: Don't rearm the read expiration date if EOI was reached"). But it is
still possible to hit it if there is no server timeout. At first glance, the
2.8 is not affected. But the fix remains valid.
When a shutdown for writes if performed the H1 connection must be notified
to be released. If it is subscribed for any I/O events, it is not an
issue. But, if there is no subscription, no I/O event is reported to the H1
connection and it remains alive. If the message was already fully received,
nothing more happens.
On my side, I was able to trigger the bug by freezing the session. Some
users reported a spinning loop on process_stream(). Not sure how to trigger
the loop. To freeze the session, the client timeout must be reached while
the server response was already fully received. On old version (< 2.6), it
only happens if there is no server timeout.
To fix the issue, we must wake up the H1 connection on shutdown for writes
if there is no I/O subscription.
This patch must be backported as far as 2.0. It should fix the issue #2090
and #2088.
Since <cond> optional argument support was added to set-var() and friends
in 2.6 with 164726c ("DOC: vars: Add documentation about the set-var
conditions"), dconv is having a hard time rendering related keywords.
Everywhere `[,<cond> ...]` was inserted, html formatting is now broken.
Removing the space between <cond> and '...' allows dconv to properly parse
the token thus restores proper formatting without changing the meaning.
This was discovered when discussing about var() sample fetch doc issues
in GH #2087
BUG/MEDIUM: stats: Consume the request except when parsing the POST payload
The stats applet is designed to consume the request at the end, when it
finishes to send the response. And during the response forwarding, because
the request is not consumed, the applet states it will not consume
data. This avoid to wake the applet up in loop. When it finishes to send the
response, the request is consumed.
For POST requests, there is no issue because the response is small
enough. It is sent in one time and must be processed by HTTP analyzers. Thus
the forwarding is not performed by the applet itself. The applet is always
able to consume the request, regardless the payload length.
But for other requests, it may be an issue. If the response is too big to be
sent in one time and if the requests is not fully received when the response
headers are sent, the applet may be blocked infinitely, not consuming the
request. Indeed, in the case the applet will be switched in infinite forward
mode, the request will not be consumed immediately. At the end, the request
buffer is flushed. But if some data must still be received, the applet is not
woken up because it is still in a "not-consuming" mode.
So, to fix the issue, we must take care to re-enable data consuming when the
end of the response is reached.
BUG/MINOR: syslog: Request for more data if message was not fully received
In the syslog applet, when a message was not fully received, we must request
for more data by calling appctx_need_more_data() and not by setting
CF_READ_DONTWAIT flag on the request channel. Indeed, this flag is only used
to only try a read at once.
This patch could be backported as far as 2.4. On 2.5 and 2.4,
applet_need_more_data() must be replaced by si_cant_get().
Replace all BUG_ON() on frame allocation failure by a CONNECTION_CLOSE
sending with INTERNAL_ERROR code. This can happen for the following
cases :
* sending of MAX_STREAM_DATA
* sending of MAX_DATA
* sending of MAX_STREAMS_BIDI
In other cases (STREAM, STOP_SENDING, RESET_STREAM), an allocation
failure will only result in the current operation to be interrupted and
retried later. However, it may be desirable in the future to replace
this with a simpler CONNECTION_CLOSE emission to recover better under a
memory pressure issue.
Emit a CONNECTION_CLOSE with INTERNAL_ERROR code each time qcs
allocation fails. This can happen in two cases :
* when creating a local stream through application layer
* when instantiating a remote stream through qcc_get_qcs()
In both cases, error paths are already in place to interrupt the current
operation and a CONNECTION_CLOSE will be emitted soon after.
MINOR: mux-quic: ensure CONNECTION_CLOSE is scheduled once per conn
Add BUG_ON() statements to ensure qcc_emit_cc()/qcc_emit_cc_app() is not
called more than one time for each connection. This should improve code
resilience of MUX-QUIC and H3 and it will ensure that a scheduled
CONNECTION_CLOSE is not overwritten by another one with a different
error code.
This commit relies on the previous one to ensure all QUIC operations are
not conducted as soon as a CONNECTION_CLOSE has been prepared :
commit d7fbf458f8a4c5b09cbf0da0208fbad70caaca33
MINOR: mux-quic: interrupt most operations if CONNECTION_CLOSE scheduled
MINOR: mux-quic: interrupt qcc_recv*() operations if CC scheduled
Ensure that external MUX operations are interrupted if a
CONNECTION_CLOSE is scheduled. This was already the cases for some
functions. This is extended to the qcc_recv*() family for
MAX_STREAM_DATA, RESET_STREAM and STOP_SENDING.
Also, qcc_release_remote_stream() is skipped in qcs_destroy() if a
CONNECTION_CLOSE is already scheduled.
All of this will ensure we only proceed to minimal treatment as soon as
a CONNECTION_CLOSE is prepared. Indeed, all sending and receiving is
stopped as soon as a CONNECTION_CLOSE is emitted so only internal
cleanup code should be necessary at this stage.
This should prevent a registered CONNECTION_CLOSE error status to be
overwritten by an error in a follow-up treatment.
Amaury Denoyelle [Mon, 20 Mar 2023 16:34:22 +0000 (17:34 +0100)]
BUG/MINOR: mux-quic: prevent CC status to be erased by shutdown
HTTP/3 graceful shutdown operation is used to emit a GOAWAY followed by
a CONNECTION_CLOSE with H3_NO_ERROR status. It is used for every
connection on release which means that if a CONNECTION_CLOSE was already
registered for a previous error, its status code is overwritten.
To fix this, skip shutdown operation if a CONNECTION_CLOSE is already
registered at the MUX level. This ensures that the correct error status
is reported to the peer.
This should be backported up to 2.6. Note that qc_shutdown() does not
exists on 2.6 so modification will have to be made directly in
qc_release() as followed :
BUG/MINOR: h3: properly handle incomplete remote uni stream type
A H3 unidirectional stream is always opened with its stream type first
encoded as a QUIC variable integer. If the STREAM frame contains some
data but not enough to decode this varint, haproxy would crash due to an
ABORT_NOW() statement.
To fix this, ensure we support an incomplete stream type. In this case,
h3_init_uni_stream() returns 0 and the buffer content is not cleared.
Stream decoding will resume when new data are received for this stream
which should be enough to decode the stream type varint.
This bug has never occured on production because standard H3 stream types
are small enough to be encoded on a single byte.
Willy Tarreau [Wed, 22 Mar 2023 17:01:41 +0000 (18:01 +0100)]
MINOR: pools: report a replaced memory allocator instead of just malloc_trim()
Instead of reporting the inaccurate "malloc_trim() support" on -vv, let's
report the case where the memory allocator was actively replaced from the
one used at build time, as this is the corner case we want to be cautious
about. We also put a tainted bit when this happens so that it's possible
to detect it at run time (e.g. the user might have inherited it from an
environment variable during a reload operation).
The now unused is_trim_enabled() function was finally dropped.
Willy Tarreau [Wed, 22 Mar 2023 16:52:05 +0000 (17:52 +0100)]
BUG/MINOR: pools: restore detection of built-in allocator
The runtime detection of the default memory allocator was broken by
Commit d8a97d8f6 ("BUG/MINOR: illegal use of the malloc_trim() function
if jemalloc is used") due to a misunderstanding of its role. The purpose
is not to detect whether we're on non-jemalloc but whether or not the
allocator was changed from the one we booted with, in which case we must
be extra cautious and absolutely refrain from calling malloc_trim() and
its friends.
This was done only to drop the message saying that malloc_trim() is
supported, which will be totally removed in another commit, and could
possibly be removed even in older versions if this patch would get
backported since in the end it provides limited value.
Willy Tarreau [Wed, 22 Mar 2023 16:01:32 +0000 (17:01 +0100)]
MEDIUM: tools: further relax dlopen() checks too consider grouped symbols
There's a recurring issue regarding shared library loading from Lua. If
the imported library is linked with a different version of openssl but
doesn't use it, the check will trigger and emit a warning. In practise
it's not necessarily a problem as long as the API is the same, because
all symbols are replaced and the library will use the included ssl lib.
It's only a problem if the library comes with a different API because
the dynamic linker will only replace known symbols with ours, and not
all. Thus the loaded lib may call (via a static inline or a macro) a
few different symbols that will allocate or preinitialize structures,
and which will then pass them to the common symbols coming from a
different and incompatible lib, exactly what happens to users of Lua's
luaossl when building haproxy with quictls and without rebuilding
luaossl.
In order to better address this situation, we now define groups of
symbols that must always appear/disappear in a consistent way. It's OK
if they're all absent from either haproxy or the lib, it means that one
of them doesn't use them so there's no problem. But if any of them is
defined on any side, all of them must be in the exact same state on the
two sides. The symbols are represented using a bit in a mask, and the
mask of the group of other symbols they're related to. This allows to
check 64 symbols, this should be OK for a while. The first ones that
are tested for now are symbols whose combination differs between
openssl versions 1.0, 1.1, and 3.0 as well as quictls. Thus a difference
there will indicate upcoming trouble, but no error will mean that we're
running on a seemingly compatible API and that all symbols should be
replaced at once.
The same mechanism could possibly be used for pcre/pcre2, zlib and the
few other optional libs that may occasionally cause runtime issues when
used by dependencies, provided that discriminatory symbols are found to
distinguish them. But in practice such issues are pretty rare, mainly
because loading standard libs via Lua on a custom build of haproxy is
not pretty common.
In the event that further symbol compatibility issues would be reported
in the future, backporting this patch as well as the following series
might be an acceptable solution given that the scope of changes is very
narrow (the malloc stuff is needed so that the malloc/free tests can be
dropped):
BUG/MINOR: illegal use of the malloc_trim() function if jemalloc is used
MINOR: pools: make sure 'no-memory-trimming' is always used
MINOR: pools: intercept malloc_trim() instead of trying to plug holes
MEDIUM: pools: move the compat code from trim_all_pools() to malloc_trim()
MINOR: pools: export trim_all_pools()
MINOR: pattern: use trim_all_pools() instead of a conditional malloc_trim()
MINOR: tools: relax dlopen() on malloc/free checks
Willy Tarreau [Wed, 22 Mar 2023 14:47:51 +0000 (15:47 +0100)]
MINOR: tools: relax dlopen() on malloc/free checks
Now that we can provide a safe malloc_trim() we don't need to detect
anymore that some dependencies use a different set of malloc/free
functions than ours because they will use the same as those we're
seeing, and we control their use of malloc_trim(). The comment about
the incompatibility with DEBUG_MEM_STATS is not true anymore either
since the feature relies on macros so we're now OK.
This will stop catching libraries linked against glibc's allocator
when haproxy is natively built with jemalloc. This was especially
annoying since dlopen() on a lib depending on jemalloc() tends to
fail on TLS issues.
Willy Tarreau [Wed, 22 Mar 2023 14:32:58 +0000 (15:32 +0100)]
MEDIUM: pools: move the compat code from trim_all_pools() to malloc_trim()
We already have some generic code in trim_all_pools() to implement the
equivalent of malloc_trim() on jemalloc and macos. Instead of keeping the
logic there, let's just move it to our own malloc_trim() implementation
so that we can unify the mechanism and the logic. Now any low-level code
calling malloc_trim() will either be disabled by haproxy's config if the
user decides to, or will be mapped to the equivalent mechanism if malloc()
was intercepted by a preloaded jemalloc.
Trim_all_pools() preserves the benefit of serializing threads (which we
must not impose to other libs which could come with their own threads).
It means that our own code should mostly use trim_all_pools() instead of
calling malloc_trim() directly.
Willy Tarreau [Wed, 22 Mar 2023 13:55:25 +0000 (14:55 +0100)]
MINOR: pools: intercept malloc_trim() instead of trying to plug holes
As reported by Miroslav in commit d8a97d8f6 ("BUG/MINOR: illegal use of
the malloc_trim() function if jemalloc is used") there are still occasional
cases where it's discovered that malloc_trim() is being used without its
suitability being checked first. This is a problem when using another
incompatible allocator. But there's a class of use cases we'll never be
able to cover, it's dynamic libraries loaded from Lua. In order to address
this more reliably, we now define our own malloc_trim() that calls the
previous one after checking that the feature is supported and that the
allocator is the expected one. This way child libraries that would call
it will also be safe.
The function is intentionally left defined all the time so that it will
be possible to clean up some code that uses it by removing ifdefs.
Willy Tarreau [Wed, 22 Mar 2023 13:52:30 +0000 (14:52 +0100)]
MINOR: pools: make sure 'no-memory-trimming' is always used
The global option 'no-memory-trimming' was added in 2.6 with commit c4e56dc58 ("MINOR: pools: add a new global option "no-memory-trimming"")
but there were some cases left where it was not considered. Let's make
is_trim_enabled() also consider it.
Amaury Denoyelle [Wed, 22 Mar 2023 10:58:32 +0000 (11:58 +0100)]
MINOR: mux-quic: complete traces for qcs emission
Add traces for _qc_send_qcs() function. Most notably, traces have been
added each time a qc_stream_desc buffer allocation fails and when stream
or connection flow-level is reached. This should improve debugging for
emission issues.
Amaury Denoyelle [Wed, 22 Mar 2023 10:17:59 +0000 (11:17 +0100)]
BUG/MEDIUM: mux-quic: release data from conn flow-control on qcs reset
Connection flow-control level calculation is a bit complicated. To
ensure it is never exceeded, each time a transfer occurs from a
qcs.tx.buf to its qc_stream_desc buffer it is accounted in
qcc.tx.offsets at the connection level. This value is not decremented
even if the corresponding STREAM frame is rejected by the quic-conn
layer as its emission will be retried later.
In normal cases this works as expected. However there is an issue if a
qcs instance is removed with prepared data left. In this case, its data
is still accounted in qcc.tx.offsets despite being removed which may
block other streams. This happens every time a qcs is reset with
remaining data which will be discarded in favor of a RESET_STREAM frame.
To fix this, if a stream has prepared data in qcc_reset_stream(), it is
decremented from qcc.tx.offsets. A BUG_ON() has been added to ensure
qcs_destroy() is never called for a stream with prepared data left.
This bug can cause two issues :
* transfer freeze as data unsent from closed streams still count on the
connection flow-control limit and will block other streams. Note that
this issue was not reproduced so it's unsure if this really happens
without the following issue first.
* a crash on a BUG_ON() statement in qc_send() loop over
qc_send_frames(). Streams may remained in the send list with nothing
to send due to connection flow-control limit. However, limit is never
reached through qcc_streams_sent_done() so QC_CF_BLK_MFCTL flag is not
set which will allow the loop to continue.
The last case was reproduced after several minutes of testing using the
following command :
Miroslav Zagorac [Wed, 22 Mar 2023 11:52:19 +0000 (12:52 +0100)]
BUG/MINOR: illegal use of the malloc_trim() function if jemalloc is used
In the event that HAProxy is linked with the jemalloc library, it is still
shown that malloc_trim() is enabled when executing "haproxy -vv":
..
Support for malloc_trim() is enabled.
..
It's not so much a problem as it is that malloc_trim() is called in the
pat_ref_purge_range() function without any checking.
This was solved by setting the using_default_allocator variable to the
correct value in the detect_allocator() function and before calling
malloc_trim() it is checked whether the function should be called.
Willy Tarreau [Wed, 22 Mar 2023 10:37:54 +0000 (11:37 +0100)]
MINOR: debug: support dumping the libs addresses when running in verbose mode
Starting haproxy with -dL helps enumerate the list of libraries in use.
But sometimes in order to go further we'd like to see their address
ranges. This is already supported on the CLI's "show libs" but not on
the command line where it can sometimes help troubleshoot startup issues.
Let's dump them when in verbose mode. This way it doesn't change the
existing behavior for those trying to enumerate libs to produce an archive.
Willy Tarreau [Wed, 22 Mar 2023 09:28:50 +0000 (10:28 +0100)]
BUILD: thread: silence a build warning when threads are disabled
When threads are disabled, the compiler complains that we might be
accessing tg->abs[] out of bounds since the array is of size 1. It
cannot know that the condition to do this is never met, and given
that it's not in a fast path, we can make it more obvious.
Willy Tarreau [Wed, 22 Mar 2023 09:26:58 +0000 (10:26 +0100)]
BUILD: thread: implement thread_harmless_end_sig() for threadless builds
Building without thread support was broken in 2.8-dev2 with commit 7e70bfc8c ("MINOR: threads: add a thread_harmless_end() version that
doesn't wait") that forgot to define the function for the threadless
cases. No backport is needed.
Amaury Denoyelle [Tue, 21 Mar 2023 10:39:24 +0000 (11:39 +0100)]
BUG/MINOR: quic: ignore congestion window on probing for MUX wakeup
qc_notify_send() is used to wake up the MUX layer for sending. This
function first ensures that all sending condition are met to avoid to
wake up the MUX for unnecessarily.
One of this condition is to check if there is room in the congestion
window. However, when probe packets must be sent due to a PTO
expiration, RFC 9002 explicitely mentions that the congestion window
must be ignored which was not the case prior to this patch.
This commit fixes this by first setting <pto_probe> of 01RTT packet
space before invoking qc_notify_send(). This ensures that congestion
window won't be checked anymore to wake up the MUX layer until probing
packets are sent.
This commit replaces the following one which was not sufficient :
commit e25fce03ebe3307bc104d1f81356108e271d2bc3
BUG/MINOR: quic: Dysfunctional 01RTT packet number space probing
Amaury Denoyelle [Mon, 20 Mar 2023 17:29:36 +0000 (18:29 +0100)]
BUG/MINOR: quic: wake up MUX on probing only for 01RTT
On PTO probe timeout expiration, a probe packet must be emitted.
quic_pto_pktns() is used to determine for which packet space the timer
has expired. However, if MUX is already subscribed for sending, it is
woken up without checking first if this happened for the 01RTT packet
space.
It is unsure that this is really a bug as in most cases, MUX is
established only after Initial and Handshake packet spaces are removed.
However, the situation is not se clear when 0-RTT is used. For this
reason, adjust the code to explicitely check for the 01RTT packet space
before waking up the MUX layer.
This should be backported up to 2.6. Note that qc_notify_send() does not
exists in 2.6 so it should be replaced by the explicit block checking
(qc->subs && qc->subs->events & SUB_RETRY_SEND).
Willy Tarreau [Tue, 21 Mar 2023 09:50:51 +0000 (10:50 +0100)]
BUG/MEDIUM: applet: only set appctx->sedesc on successful allocation
If appctx_new_on() fails to allocate a task, it will not remove the
freshly allocated sedesc from the appctx despite freeing it, causing
a UAF. Let's only assign appctx->sedesc upon success.
This needs to be backported to 2.6. In 2.6 the function is slightly
different and called appctx_new(), though the issue is exactly the
same.
Willy Tarreau [Tue, 21 Mar 2023 09:44:44 +0000 (10:44 +0100)]
BUG/MEDIUM: mux-h1: properly destroy a partially allocated h1s
In h1c_frt_stream_new() and h1c_bck_stream_new(), if we fail to completely
initialize the freshly allocated h1s, typically because sc_attach_mux()
fails, we must use h1s_destroy() to de-initialize it. Otherwise it stays
attached to the h1c when released, causing use-after-free upon the next
wakeup. This can be triggered upon memory shortage.
Willy Tarreau [Tue, 21 Mar 2023 08:24:53 +0000 (09:24 +0100)]
MINOR: pools: preset the allocation failure rate to 1% with -dMfail
Using -dMfail alone does nothing unless tune.fail-alloc is set, which
renders it pretty useless as-is, and is not intuitive. Let's change
this so that the filure rate is preset to 1% when the option is set on
the command line. This allows to inject failures without having to edit
the configuration.
Willy Tarreau [Tue, 21 Mar 2023 08:15:13 +0000 (09:15 +0100)]
MINOR: dynbuf: set POOL_F_NO_FAIL on buffer allocation
b_alloc() is used to allocate a buffer. We can provoke fault injection
based on forced memory allocation failures using -dMfail on the command
line, but we know that the buffer_wait list is a bit weak and doesn't
always recover well. As such, submitting buffer allocation to such a
treatment seriously limits the usefulness of -dMfail which cannot really
be used for other purposes. Let's just disable it for buffers for now.
Willy Tarreau [Mon, 20 Mar 2023 18:53:14 +0000 (19:53 +0100)]
BUG/MINOR: stconn: fix sedesc memory leak on stream allocation failure
If we fail to allocate a new stream in sc_new_from_endp(), and the call
to sc_new() allocated the sedesc itself (which normally doesn't happen),
then it doesn't get released on the failure path. Let's explicitly
handle this case so that it's not overlooked and avoids some head
scratching sessions.
Willy Tarreau [Mon, 20 Mar 2023 18:45:41 +0000 (19:45 +0100)]
BUG/MEDIUM: stconn: don't set the type before allocation succeeds
There's an occasional crash that can be triggered in sc_detach_endp()
when calling conn->mux->detach() upon memory allocation error. The
problem in fact comes from sc_attach_mux(), which doesn't reset the
sc type flags upon tasklet allocation failure, leading to an attempt
at detaching an incompletely initialized stconn. Let's just attach
the sc after the tasklet allocation succeeds, not before.
Willy Tarreau [Mon, 20 Mar 2023 18:16:04 +0000 (19:16 +0100)]
BUG/MEDIUM: mux-h2: erase h2c->wait_event.tasklet on error path
On the allocation error path in h2_init() we may check if
h2c->wait_event.tasklet needs to be released but it has not yet been
zeroed. Let's do this before jumping to the freeing location.
This needs to be backported to all maintained versions.
Willy Tarreau [Mon, 20 Mar 2023 18:14:47 +0000 (19:14 +0100)]
BUG/MEDIUM: mux-h2: do not try to free an unallocated h2s->sd
In h2s_close() we may dereference h2s->sd to get the sc, but this
function may be called on allocation error paths, so we must check
for this specific condition. Let's also update the comment to make
it explicitly permitted.
Willy Tarreau [Mon, 20 Mar 2023 18:11:08 +0000 (19:11 +0100)]
BUG/MEDIUM: stream: do not try to free a failed stream-conn
In stream_free() if we fail to allocate s->scb() we go to the path where
we try to free it, and it doesn't like being called with a null at all.
It's easily reproducible with -dMfail,no-cache and "tune.fail-alloc 10"
in the global section.
BUG/MINOR: quic: Dysfunctional 01RTT packet number space probing
This bug arrived with this commit:
"MINOR: quic: implement qc_notify_send()".
The ->tx.pto_probe variable was no more set when qc_processt_timer() the timer
task for the connection responsible of detecting packet loss and probing upon
PTO expiration leading to interrupted stream transfers. This was revealed by
blackhole interop failed tests where one could see that qc_process_timer()
was wakeup without traces as follows in the log file:
"needs to probe 01RTT packet number space"
Must be backported to 2.7 and to 2.6 if the commit mentionned above
is backported to 2.6 in the meantime.
MINOR: quic: Stop stressing the acknowledgments process (RX ACK frames)
The ACK frame range of packets were handled from the largest to the smallest
packet number, leading to big number of ebtree insertions when the packet are
handled in the inverse way they are sent. This was detected a long time ago
but left in the code to stress our implementation. It is time to be more
efficient and process the packet so that to avoid useless ebtree insertions.
Modify qc_ackrng_pkts() responsible of handling the acknowledged packets from an
ACK frame range of acknowledged packets.
Willy Tarreau [Fri, 17 Mar 2023 15:13:05 +0000 (16:13 +0100)]
MINOR: ssl-sock: pass the CO_SFL_MSG_MORE info down the stack
Despite having replaced the SSL BIOs to use our own raw_sock layer, we
still didn't exploit the CO_SFL_MSG_MORE flag which is pretty useful to
avoid sending incomplete packets. It's particularly important for SSL
since the extra overhead almost guarantees that each send() will be
followed by an incomplete (and often odd-sided) segment.
We already have an xprt_st set of flags to pass info to the various
layers, so let's just add a new one, SSL_SOCK_SEND_MORE, that is set
or cleared during ssl_sock_from_buf() to transfer the knowledge of
CO_SFL_MSG_MORE. This way we can recover this information and pass
it to raw_sock.
This alone is sufficient to increase by ~5-10% the H2 bandwidth over
SSL when multiple streams are used in parallel.
Willy Tarreau [Fri, 17 Mar 2023 15:09:14 +0000 (16:09 +0100)]
MINOR: mux-h2: set CO_SFL_MSG_MORE when sending multiple buffers
Traces show that sendto() rarely has MSG_MORE on H2 despite sending
multiple buffers. The reason is that the loop iterating over the buffer
ring doesn't have this info and doesn't pass it down.
But now we know how many buffers are left to be sent, so we know whether
or not the current buffer is the last one. As such we can set this flag
for all buffers but the last one.
Willy Tarreau [Fri, 17 Mar 2023 11:30:38 +0000 (12:30 +0100)]
OPTIM: mux-h1: limit first read size to avoid wrapping
Before muxes were used, we used to refrain from reading past the
buffer's reserve. But with muxes which have their own buffer, this
rule was a bit forgotten, resulting in an extraneous read to be
performed just because the rx buffer cannot be entirely transferred
to the stream layer:
Here the server sends 16kB of payload after a headers block, the mux reads
16320 into the ibuf, and the stream layer consumes 15360 from the first
h1_rcv_buf(), which leaves 960 into the buffer and releases a few indexes.
The buffer cannot be realigned due to these remaining data, and a subsequent
read is made on 16 bytes, then again on 182 bytes.
By avoiding to read too much on the first call, we can avoid needlessly
filling this buffer:
This is much more efficient and uses less RAM since the first buffer that
was emptied can now be released.
Note that a further improvement (tested) consists in reading even less
(typically 1kB) so that most of the data are transferred in zero-copy, and
are not read until process_stream() is scheduled. This patch doesn't do that
for now so that it can be backported without any obscure impact.
Willy Tarreau [Fri, 17 Mar 2023 15:40:09 +0000 (16:40 +0100)]
BUG/MAJOR: qpack: fix possible read out of bounds in static table
CertiK Skyfall Team reported that passing an index greater than
QPACK_SHT_SIZE in a qpack instruction referencing a literal field
name with name reference or and indexed field line will cause a
read out of bounds that may crash the process, and confirmed that
this fix addresses the issue.
When sc-add-gpc() action was implemented in 5a72d03 ("MINOR:
stick-table: implement the sc-add-gpc() action"), its usage was
only documented for "http-request", but according to the code it
now applies everywhere sc-inc-gpc() is mentioned.
Adding the missing entries in the doc everywhere the action may
be used.
The issue was detected by the haproxy-controller bot and was reported
by Pratik Mohanty and Marko Juraga.
No backport needed, unless 5a72d03 ("MINOR: stick-table: implement the
sc-add-gpc() action") is being backported.
MINOR: stick-table: add sc-add-gpc() to http-after-response
sc-add-gpc() was implemented in 5a72d03 ("MINOR: stick-table:
implement the sc-add-gpc() action")
This new action was exposed everywhere sc-inc-gpc() is available,
except for http-after-response.
But there doesn't seem to be a technical constraint that prevents us from
exposing it in http-after-response.
It was probably overlooked, let's add it.
No backport needed, unless 5a72d03 ("MINOR: stick-table: implement the
sc-add-gpc() action") is being backported.
BUG/MINOR: quic: Missing STREAM frame data pointer updates
This patch follows this one which was not sufficient:
"BUG/MINOR: quic: Missing STREAM frame length updates"
Indeed, it is not sufficient to update the ->len and ->offset member
of a STREAM frame to move it forward. The data pointer must also be updated.
This is not done by the STREAM frame builder.
Willy Tarreau [Thu, 16 Mar 2023 16:30:30 +0000 (17:30 +0100)]
BUG/MINOR: mux-h2: set CO_SFL_STREAMER when sending lots of data
Emeric noticed that h2 bit-rate performance was always slightly lower
than h1 when the CPU is saturated. Strace showed that we were always
data in 2kB chunks, corresponding to the max_record size. What's
happening is that when this mechanism of dynamic record size was
introduced, the STREAMER flag at the stream level was relied upon.
Since all this was moved to the muxes, the flag has to be passed as
an argument to the snd_buf() function, but the mux h2 did not use it
despite a comment mentioning it, probably because before the multi-buf
it was not easy to figure the status of the buffer.
The solution here consists in checking if the mbuf is congested or not,
by checking if it has more than one buffer allocated. If so we set the
CO_SFL_STREAMER flag, otherwise we don't. This way moderate size
exchanges continue to be made over small chunks, but downloads will
be able to use the large ones.
While it could be backported to all supported versions, it would be
better to limit it to the last LTS, so let's do it for 2.7 and 2.6 only.
This patch requires previous commit "MINOR: buffer: add br_single() to
check if a buffer ring has more than one buf".
Willy Tarreau [Thu, 16 Mar 2023 15:47:44 +0000 (16:47 +0100)]
BUG/MEDIUM: mux-h2: only restart sending when mux buffer is decongested
During performance tests, Emeric faced a case where the wakeups of
sc_conn_io_cb() caused by h2_resume_each_sending_h2s() was multiplied
by 5-50 and a lot of CPU was being spent doing this for apparently no
reason.
The culprit is h2_send() not behaving well with congested buffers and
small SSL records. What happens when the output is congested is that
all buffers are full, and data are emitted in 2kB chunks, which are
sufficient to wake all streams up again to ask them to send data again,
something that will obviously only work for one of them at best, and
waste a lot of CPU in wakeups and memcpy() due to the small buffers.
When this happens, the performance can be divided by 2-2.5 on large
objects.
Here the chosen solution against this is to keep in mind that as long
as there are still at least two buffers in the ring after calling
xprt->snd_buf(), it means that the output is congested and there's
no point trying again, because these data will just be placed into
such buffers and will wait there. Instead we only mark the buffer
decongested once we're back to a single allocated buffer in the ring.
By doing so we preserve the ability to deal with large concurrent
bursts while not causing a thundering herd by waking all streams for
almost nothing.
This needs to be backported to 2.7 and 2.6. Other versions could
benefit from it as well but it's not strictly necessary, and we can
reconsider this option if some excess calls to sc_conn_io_cb() are
faced.
Note that this fix depends on this recent commit:
MINOR: buffer: add br_single() to check if a buffer ring has more than one buf
Willy Tarreau [Thu, 16 Mar 2023 16:30:04 +0000 (17:30 +0100)]
MINOR: buffer: add br_single() to check if a buffer ring has more than one buf
It's cheaper and cleaner than using br_count()==1 given that it just compares
two indexes, and that a ring having a single buffer is in a special case where
it is between empty and used up-to-1. In other words it's not congested.