BUG/MEDIUM: server/log: "mode log" after server keyword causes crash
In 9a74a6c ("MAJOR: log: introduce log backends"), a mistake was made:
it was assumed that the proxy mode was already known during server
keyword parsing in parse_server() function, but this is wrong.
Indeed, "mode log" can be declared late in the proxy section. Due to this,
a simple config like this will cause the process to crash:
|backend test
|
| server name 127.0.0.1:8080
| mode log
In order to fix this, we relax some checks in _srv_parse_init() and store
the address protocol from str2sa_range() in server struct, then we set-up
a postparsing function that is to be called after config parsing to
finish the server checks/initialization that depend on the proxy mode
to be known. We achieve this by checking the PR_CAP_LB capability from
the parent proxy to know if we're in such case where the effective proxy
mode is not yet known (it is assumed that other proxies which are implicit
ones don't provide this possibility and thus don't suffer from this
constraint).
Only then, if the capability is not found, we immediately perform the
server checks that depend on the proxy mode, else the check is postponed
and it will automatically be performed during postparsing thanks to the
REGISTER_POST_SERVER_CHECK() hook.
Note that we remove the SRV_PARSE_IN_LOG_BE flag because it was introduced
in the above commit and it is no longer relevant.
No backport needed unless 9a74a6c gets backported.
Willy Tarreau [Wed, 25 Oct 2023 09:43:01 +0000 (11:43 +0200)]
DEBUG: mux-h2/flags: fix list of h2c flags used by the flags decoder
The two recent commits below each added one flag to h2c but omitted to
update the __APPEND_FLAG macro used by dev/flags so they are not
properly decoded:
3dd963b35 ("BUG/MINOR: mux-h2: fix http-request and http-keep-alive timeouts again") 68d02e5fa ("BUG/MINOR: mux-h2: make up other blocked streams upon removal from list")
Amaury Denoyelle [Wed, 25 Oct 2023 08:10:14 +0000 (10:10 +0200)]
MINOR: backend: refactor insertion in avail conns tree
Define a new function srv_add_to_avail_list(). This function is used to
centralize connection insertion in available tree. It reuses a BUG_ON()
statement to ensure the connection is not present in the idle list.
Amaury Denoyelle [Tue, 24 Oct 2023 16:31:55 +0000 (18:31 +0200)]
BUG/MAJOR: backend: fix idle conn crash under low FD
Since the following commit, idle conns are stored in a list as secondary
storage to retrieve them in usage order : 5afcb686b93c3811bd859a331efd6a8341a61218
MAJOR: connection: purge idle conn by last usage
The list usage has been extended wherever connections lookup are done
both on idle and safe trees. This reduced the code size by replacing a
two tree loops by a single list loop.
LIST_ELEM() is used in this context to retrieve the first idle list
element from the server list head. However, macro usage was wrong due to
an extra '&' operator which returns an invalid connection reference.
This will most of the time caused a crash on conn_delete_from_tree() or
affiliated functions.
This bug only occurs if the FD pool is exhausted and some idle
connections are selected to be killed.
It can be reproduced using the following config and h2load command :
$ h2load -t 8 -c 800 -m 10 -n 800 "http://127.0.0.1:21080/?s=10k"
listen li
bind :21080 proto h2
server nginx 127.99.0.1:30080 proto h1
This bug has been introduced by the above commit. Thus no need to
backport this fix.
Note that LIST_ELEM() macro usage was slightly adjusted also in
srv_migrate_conns_to_remove(). The function used toremove_list instead
of idle_list connection list element. This is not a bug as they are
stored in the same union. However, the new code is clearer as it intends
to move connection from the idle_list only into the toremove_list
mt-list.
Amaury Denoyelle [Tue, 24 Oct 2023 16:31:28 +0000 (18:31 +0200)]
BUG/MINOR: backend: fix wrong BUG_ON for avail conn
Idle connections are both stored in an idle/safe tree and in an idle
list. The list is used as a secondary storage to be able to retrieve
them by usage order.
If a connection is moved into the available tree, it must not be present
in the idle list. A BUG_ON() was written to check this but was placed at
the wrong code section. Fix this by removing the misplaced one and write
new ones for avail_conns tree insertion and lookup.
The impact of this bug is minor as the misplaced BUG_ON() did not seem
to be triggered.
Tristan [Mon, 23 Oct 2023 12:35:40 +0000 (13:35 +0100)]
MINOR: lua: change tune.lua.log.stderr default from 'on' to 'auto'
After making it configurable in previous commit "MINOR: lua: Add flags
to configure logging behaviour", this patch changes the default value
of tune.lua.log.stderr from 'on' (unconditionally forward LUA logs to
stderr) to 'auto' (only forward LUA logs to stderr if logging via a
standard logger is disabled, or none is configured for the current context)
Since this is a change in behaviour, it shouldn't be backported
Tristan [Mon, 23 Oct 2023 12:07:39 +0000 (13:07 +0100)]
MINOR: lua: Add flags to configure logging behaviour
Until now, messages printed from LUA log functions were sent both to
the any logger configured for the current proxy, and additionally to
stderr (in most cases)
This introduces two flags to configure LUA log handling:
- tune.lua.log.loggers to use standard loggers or not
- tune.lua.log.stderr to use stderr, or not, or only conditionally
This addresses github feature request #2316
This can be backported to 2.8 as it doesn't change previous behaviour.
Willy Tarreau [Fri, 20 Oct 2023 19:36:47 +0000 (21:36 +0200)]
[RELEASE] Released version 2.9-dev8
Released version 2.9-dev8 with the following main changes :
- MINOR: ssl: add an explicit error when 'ciphersuites' are not supported
- BUILD: ssl: enable 'ciphersuites' for WolfSSL
- BUILD: ssl: add 'ssl_c_r_dn' fetch for WolfSSL
- BUILD: ssl: add 'secure_memcmp' converter for WolfSSL and awslc
- BUILD: ssl: enable keylog for awslc
- CLEANUP: ssl: remove compat functions for openssl < 1.0.0
- BUILD: ssl: enable keylog for WolfSSL
- REGTESTS: pki: add a pki for SSL tests
- REGTESTS: ssl: update common.pem with the new pki
- REGTESTS: ssl: disable ssl_dh.vtc for WolfSSL
- REGTESTS: wolfssl: temporarly disable some failing reg-tests
- CI: ssl: add wolfssl to build-ssl.sh
- CI: ssl: add git id support for wolfssl download
- CI: github: add a wolfssl entry to the CI
- CI: github: update wolfssl to git revision d83f2fa
- CI: github: add awslc 1.16.0 to the push CI
- BUG/MINOR: quic: Avoid crashing with unsupported cryptographic algos
- REORG: quic: cleanup traces definition
- BUG/MINOR: quic: reject packet with no frame
- BUG/MEDIUM: mux-quic: fix RESET_STREAM on send-only stream
- BUG/MINOR: mux-quic: support initial 0 max-stream-data
- BUG/MINOR: h3: strengthen host/authority header parsing
- CLEANUP: connection: drop an uneeded leftover cast
- BUG/MAJOR: connection: make sure to always remove a connection from the tree
- BUG/MINOR: quic: fix qc.cids access on quic-conn fail alloc
- BUG/MINOR: quic: fix free on quic-conn fail alloc
- BUG/MINOR: mux-quic: fix free on qcs-new fail alloc
- BUG/MEDIUM: quic-conn: free unsent frames on retransmit to prevent crash
- MEDIUM: tree-wide: logsrv struct becomes logger
- MEDIUM: log: introduce log target
- DOC: config: log <address> becomes log <target> in "log" related doc
- MEDIUM: sink/log: stop relying on AF_UNSPEC for rings
- MINOR: log: support explicit log target as argument in __do_send_log()
- MINOR: log: remove the logger dependency in do_send_log()
- MEDIUM: log/sink: simplify log header handling
- MEDIUM: sink: inherit from caller fmt in ring_write() when rings didn't set one
- MINOR: sink: add sink_new_from_srv() function
- MAJOR: log: introduce log backends
- MINOR: log/balance: support for the "sticky" lb algorithm
- MINOR: log/balance: support for the "random" lb algorithm
- MINOR: lbprm: support for the "none" hash-type function
- MINOR: lbprm: compute the hash avalanche in gen_hash()
- MINOR: sample: add sample_process_cnv() function
- MEDIUM: log/balance: support for the "hash" lb algorithm
- REGTEST: add a test for log-backend used as a log target
- MINOR: server: introduce "log-bufsize" kw
- BUG/MEDIUM: stconn: Report a send activity everytime data were sent
- BUG/MEDIUM: applet: Report a send activity everytime data were sent
- BUG/MINOR: mux-h1: Send a 400-bad-request on shutdown before the first request
- MINOR: support for http-response set-timeout
- BUG/MINOR: mux-h2: make up other blocked streams upon removal from list
- DEBUG: pool: store the memprof bin on alloc() and update it on free()
- BUG/MEDIUM: quic_conn: let the scheduler kill the task when needed
- CLEANUP: hlua: Remove dead-code on error path in hlua_socket_new()
- BUG/MEDIUM: mux-h1: do not forget TLR/EOT even when no data is sent
- BUG/MINOR: htpp-ana/stats: Specify that HTX redirect messages have a C-L header
- BUG/MEDIUM: mux-h2: Don't report an error on shutr if a shutw is pending
- MEDIUM: stconn/channel: Move pipes used for the splicing in the SE descriptors
- MINOR: stconn: Start to introduce mux-to-mux fast-forwarding notion
- MINOR: stconn: Extend iobuf to handle a buffer in addition to a pipe
- MINOR: connection: Add new mux callbacks to perform data fast-forwarding
- MINOR: stconn: Temporarily remove kernel splicing support
- MINOR: mux-pt: Temporarily remove splicing support
- MINOR: mux-h1: Temporarily remove splicing support
- MINOR: connection: Remove mux callbacks about splicing
- MEDIUM: stconn: Add mux-to-mux fast-forward support
- MINOR: mux-h1: Use HTX extra field only for responses with known length
- MEDIUM: mux-h1: Properly handle state transitions of chunked outgoing messages
- MEDIUM: raw-sock: Specifiy amount of data to send via snd_pipe callback
- MINOR: mux-h1: Add function to add size of a chunk to an outgoind message
- MEDIUM: mux-h1: Simplify zero-copy on sending path
- MEDIUM: mux-h1: Simplify payload formatting based on HTX blocks on sending path
- MEDIUM: mux-h1: Add fast-forwarding support
- MINOR: h2: Set the BODYLESS_RESP flag on the HTX start-line if necessary
- MEDIUM: mux-h2: Add consumer-side fast-forwarding support
- MEDIUM: channel: don't look at iobuf to report an empty channel
- MINOR: tree-wide: Only rely on co_data() to check channel emptyness
- REGTESTS: Reenable HTTP tests about splicing
- CLEAN: mux-h1: Remove useless __maybe_unused attribute on h1_make_chunk()
- MEDIUM: mux-pt: Add fast-forwarding support
- MINOR: global: Add an option to disable the zero-copy forwarding
- BUILD: mux-h1: Fix build without kernel splicing support
- REORG: stconn/muxes: Rename init step in fast-forwarding
- MINOR: dgram: allow to set rcv/sndbuf for dgram sockets as well
- BUG/MINOR: mux-h2: fix http-request and http-keep-alive timeouts again
- BUG/MINOR: trace: fix trace parser error reporting
- BUG/MEDIUM: peers: Be sure to always refresh recconnect timer in sync task
- BUG/MEDIUM: peers: Fix synchro for huge number of tables
- MINOR: cfgparse: forbid mixing reverse and standard listeners
- MINOR: listener: add nbconn kw for reverse connect
- MINOR: server: convert @reverse to rev@ standard format
- MINOR: cfgparse: rename "rev@" prefix to "rhttp@"
- REGTESTS: remove maxconn from rhttp bind line
- MINOR: listener: forbid most keywords for reverse HTTP bind
- MINOR: sample: Added support for Arrays in sample_conv_json_query in sample.c
- MINOR: mux-h2/traces: explicitly show the error/refused stream states
- MINOR: mux-h2/traces: clarify the "rejected H2 request" event
- BUG/MINOR: mux-h2: commit the current stream ID even on reject
- BUG/MINOR: mux-h2: update tracked counters with req cnt/req err
Willy Tarreau [Fri, 20 Oct 2023 16:38:34 +0000 (18:38 +0200)]
BUG/MINOR: mux-h2: update tracked counters with req cnt/req err
Originally H2 would transfer everything to H1 and parsing errors were
handled there, so that if there was a track-sc rule in effect, the
counters would be updated as well. As we started to add more and more
HTTP-compliance checks at the H2 layer, then switched to HTX, we
progressively lost this ability. It's a bit annoying because it means
we will not maintain accurate error counters for a given source, for
example.
This patch adds the calls to session_inc_http_req_ctr() and
session_inc_http_err_ctr() when needed (i.e. when failing to parse
an HTTP request since all other cases are handled by the stream),
just like mux-h1 does. The same should be done for mux-h3 by the
way.
This can be backported to recent stable versions. It's not exactly a
bug, rather a missing feature in that we had never updated this counter
for H2 till now, but it does make sense to do it especially based on
what the doc says about its usage.
Willy Tarreau [Fri, 20 Oct 2023 15:51:12 +0000 (17:51 +0200)]
BUG/MINOR: mux-h2: commit the current stream ID even on reject
The H2 spec says that a HEADERS frame turns an idle stream to the open
state, and it may then turn to half-closed(remote) on ES, then to close,
all at once, if we respond with RST (e.g. on error). Due to the fact that
we process a complete frame at once since h2_dec_hdrs() may reassemble
CONTINUATION frames until everything is complete, the state was only
committed after the frame was completley valid (otherwise multiple passes
could result in subsequent frames being rejected as the stream ID would
be equal to the highest one).
However this is not correct because it means that a client may retry on
the same ID as a previously failed one, which technically is forbidden
(for example the client couldn't know which of them a WINDOW_UPDATE or
RST_STREAM frame is for).
In practice, due to the error paths, this would only be possible when
failing to decode HPACK while leaving the HPACK stream intact, thus
when the valid decoded HPACK stream cannot be turned into a valid HTTP
representation, e.g. when the resulting headers are too large for example.
The solution to avoid this consists in committing the stream ID on this
error path as well. h2spec continues to be happy.
Thanks to Annika Wickert and Tim Windelschmidt for reporting this issue.
This fix must be backported to all stable versions.
Willy Tarreau [Fri, 20 Oct 2023 15:47:33 +0000 (17:47 +0200)]
MINOR: mux-h2/traces: clarify the "rejected H2 request" event
In h2_frt_handle_headers() all failures lead to a generic message saying
"rejected H2 request". It's quite inexpressive while there are a few
distinct tests that are made before jumping there:
Let's emit the traces from these call points instead so that we get more
info about what happened. Since these are user-level messages, we take
care of keeping them aligned as much as possible.
Willy Tarreau [Fri, 20 Oct 2023 15:32:13 +0000 (17:32 +0200)]
MINOR: mux-h2/traces: explicitly show the error/refused stream states
Sometimes it's unclear whether a stream is still open or closed when
certain traces are emitted, for example when the stream was refused,
because the reported pointer and ID in fact correspond to the refused
stream. And for closed streams, no pointer/name is printed, leaving
some confusion about the state. This patch makes the situation easier
to analyse by explicitly reporting "h2s=CLO" on closed/error/refused
streams so that we don't waste time comparing pointers and we instantly
know the stream is closed. Now instead of emitting:
[03|h2|5|mux_h2.c:2874] h2c_frt_handle_headers(): leaving on error : h2c=0x7fdfa8026820(F,FRE) dsi=201 h2s=0x9fdb60(0,CLO)
It will emit:
[03|h2|5|mux_h2.c:2874] h2c_frt_handle_headers(): leaving on error : h2c=0x7fdfa8026820(F,FRE) dsi=201 h2s=CLO
MINOR: sample: Added support for Arrays in sample_conv_json_query in sample.c
Method now returns the content of Json Arrays, if it is specified in
Json Path as String. The start and end character is a square bracket. Any
complex object in the array is returned as Json, so that you might get Arrays
of Array or objects. Only recommended for Arrays of simple types (e.g.,
String or int) which will be returned as CSV String. Also updated
documentation and fixed issue with parenthesis and other changes from
comments.
This patch was discussed in issue #2281.
Signed-off-by: William Lallemand <wlallemand@haproxy.com>
Amaury Denoyelle [Fri, 20 Oct 2023 14:49:03 +0000 (16:49 +0200)]
MINOR: listener: forbid most keywords for reverse HTTP bind
Reverse HTTP bind is very specific in that in rely on a server to
initiate connection. All connection settings are defined on the server
line and ignored from the bind line.
Before this patch, most of keywords were silently ignored. This could
result in a configuration from doing unexpected things from the user
point of view. To improve this situation, add a new 'rhttp_ok' field in
bind_kw structure. If not set, the keyword is forbidden on a reverse
bind line and will cause a fatal config error.
For the moment, only the following keywords are usable with reverse bind
'id', 'name' and 'nbconn'.
This change is safe as it's already forbidden to mix reverse and
standard addresses on the same bind line.
Amaury Denoyelle [Fri, 20 Oct 2023 15:26:24 +0000 (17:26 +0200)]
REGTESTS: remove maxconn from rhttp bind line
The maxconn keyword is not used anymore for reverse HTTP bind. It has
been replaced recently by the new keyword nbconn. As it's default value
is 1, it can be safely removed from the regtest without affecting its
behavior.
Amaury Denoyelle [Thu, 19 Oct 2023 07:25:20 +0000 (09:25 +0200)]
MINOR: listener: add nbconn kw for reverse connect
Previously, maxconn keyword was reused for a specific usage on reverse
HTTP binds to specify the number of active connect to proceed. To avoid
confusion, introduce a new dedicated keyword 'nbconn' which is specific
to reverse HTTP bind.
This new keyword is forbidden for non-reverse listener. A fatal error is
emitted during config parsing if this rule is not respected. It's safe
because it's also forbidden to mix standard and reverse addresses on the
same bind line.
Internally, nbconn value will be reassigned to 'maxconn' member of
bind_conf structure. This ensures that listener layer will automatically
reenable the preconnect task each time a connection is closed.
Amaury Denoyelle [Thu, 19 Oct 2023 10:05:31 +0000 (12:05 +0200)]
MINOR: cfgparse: forbid mixing reverse and standard listeners
Reverse HTTP listeners are very specific and share only a very limited
subset of keywords with other listeners. As such, it is probable
meaningless to mix standard and reverse addresses on the same bind line.
This patch emits a fatal error during configuration parsing if this is
the case.
BUG/MEDIUM: peers: Fix synchro for huge number of tables
The number of updates sent at once was limited to not loop too long to emit
updates when the buffer size is huge or when the number of sync tables is
huge. The limit can be configured and is set to 200 by default. However,
this fix introduced a bug. It is impossible to syncrhonize two peers if the
number of tables is higher than this limit. Thus by default, it is not
possible to sync two peers if there are more than 200 tables to sync.
Technically speacking, a teaching process is finished if we loop on all tables
with no new update messages sent. Because we are limited at each call, the loop
is splitted on several calls. However the restart point for the next loop is
always the last table for which we emitted an update message. Thus with more
tables than the limit, the loop never reachs the end point.
Worse, in conjunction with the bug fixed by "BUG/MEDIUM: peers: Be sure to
always refresh recconnect timer in sync task", it is possible to trigger the
watchdog because the applets may be woken up in loop and leave requesting
more room while its buffer is empty.
To fix the issue, restart conditions for a teaching loop were changed. If
the teach process is interrupted, we now save the restart point, called
stop_local_table. It is the last evaluated table on the previous loop. This
restart point is reset when the teach process is finished.
In additionn, the updates_sent variable in peer_send_msgs() was renamed to
updates to avoid ambiguities. Indeed, the variable is incremented, whether
messages were sent or not.
BUG/MEDIUM: peers: Be sure to always refresh recconnect timer in sync task
A sync task used to manage reconnect, sessions creation or shutdown and data
synchronization is responsible to refresh reconnect and heartbeat timers for
each remote peers and trigger applets wakeup. These timers are used to
refresh the sync task timeer itself. Thus it is important to take care to
always properly refresh them.
However, when there are some data to push, the reconnect timer is not
checked. It may be expired and not refreshed. In this case, an expired timer
may be used to the sync task, leading to a storm of wakeups. The sync task
is woken up in loop because its timer is in the past, waking up Peer applets
at each time.
To fix the issue, the peer's reconnect timer is now refresh to the default
reconnect timeout, if necessary, when there are some data to push.
This patch must be backported to all stable versions.
Since traces were adapted to support being declared in the global section
in 2.7 with commit c11f1cdf4 ("MINOR: trace: split the CLI "trace" parser
in CLI vs statement"), the method used to return the error message was
unreliable. For example an invalid sink name in the global section would
produce:
[ALERT] (26685) : config : parsing [test-trace.cfg:51] : 'trace': No such sink
[ALERT] (26685) : config : parsing [test-trace.cfg:51] : (null)
[ALERT] (26685) : config : Error(s) found in configuration file : test-trace.cfg
[ALERT] (26685) : config : Fatal errors found in configuration.
The reason is that the trace is emitted manually using ha_error() in
cfg_parse_trace() and -1 is returned without setting the message, and
the caller also prints the empty message. That's quite awkward given
that the API originally comes from the CLI which does support dynamic
strings and that config keywords do as well.
This commit modifies both cli_parse_trace() and cfg_parse_trace() to
return a dynamically allocated message instead, and adapts the central
function trace_parse_statement() to do the same, replacing a few direct
assignments with strdup() or memprintf(). This way the alert is no
longer emitted by the parser function, it just passes the message to
the caller.
A few of the static messages switching to memprintf() also took this
opportunity to report the faulty word:
[ALERT] (26772) : config : parsing [test-trace.cfg:51] : No such trace sink 'stduot'
[ALERT] (26772) : config : Error(s) found in configuration file : test-trace.cfg
[ALERT] (26772) : config : Fatal errors found in configuration.
Willy Tarreau [Wed, 18 Oct 2023 09:39:43 +0000 (11:39 +0200)]
BUG/MINOR: mux-h2: fix http-request and http-keep-alive timeouts again
Stefan Behte reported that since commit f279a2f14 ("BUG/MINOR: mux-h2:
refresh the idle_timer when the mux is empty"), the http-request and
http-keep-alive timeouts don't work anymore on H2. Before this patch,
and since 3e448b9b64 ("BUG/MEDIUM: mux-h2: make sure control frames do
not refresh the idle timeout"), they would only be refreshed after stream
frames were sent (HEADERS or DATA) but the patch above that adds more
refresh points broke these so they don't expire anymore as long as
there's some activity.
We cannot just revert the fix since it also addressed an isse by which
sometimes the timeout would trigger too early and provoque truncated
responses. The right approach here is in fact to only use refresh the
idle timer when the mux buffer was flushed from any such stream frames.
In order to achieve this, we're now setting a flag on the connection
whenever we write a stream frame, and we consider that flag when deciding
to refresh the buffer after it's emptied. This way we'll only clear that
flag once the buffer is empty and there were stream data in it, not if
there were no such stream data. In theory it remains possible to leave
the flag on if some control data is appended after the buffer and it's
never cleared, but in practice it's not a problem as a buffer will always
get sent in large blocks when the window opens. Even a large buffer should
be emptied once in a while as control frames will not fill it as much as
data frames could.
Given the patch above was backported as far as 2.6, this patch should
also be backported as far as 2.6.
Willy Tarreau [Tue, 3 Oct 2023 13:33:46 +0000 (15:33 +0200)]
MINOR: dgram: allow to set rcv/sndbuf for dgram sockets as well
tune.rcvbuf.client and tune.rcvbuf.server are not suitable for shared
dgram sockets because they're per connection so their units are not the
same. However, QUIC's listener and log servers are not connected and
take per-thread or per-process traffic where a socket log buffer might
be too small, causing undesirable packet losses and retransmits in the
case of QUIC. This essentially manifests in listener mode with new
connections taking a lot of time to set up under heavy traffic due to
the small queues causing delays. Let's add a few new settings allowing
to set these shared socket sizes on the frontend and backend side (which
reminds that these are per-front/back and not per client/server hence
not per connection).
REORG: stconn/muxes: Rename init step in fast-forwarding
Instead of speaking of an initialisation stage for each data
fast-forwarding, we now use the negociate term. Thus init_ff/init_fastfwd
functions were renamed nego_ff/nego_fastfwd.
BUILD: mux-h1: Fix build without kernel splicing support
Data fast-forwarding does not build without the kernel splicing support
because counters about splicing don't exist. To make the code more readable,
all code about splicing is disabled if kernel splicing is not supported.
MINOR: global: Add an option to disable the zero-copy forwarding
The zero-copy forwarding or the mux-to-mux forwarding is a way to
fast-forward data without using the channels buffers. Data are transferred
from a mux to the other one. The kernel splicing is an optimization of the
zero-copy forwarding. But it can also use normal buffers (but not channels
ones). This way, it could be possible to fast-forward data with muxes not
supporting the kernel splicing (H2 and H3 muxes) but also with applets.
However, this mode can introduce regressions or bugs in future (just like
the kernel splicing). Thus, It could be usefull to disable this optim. To do
so, in configuration, the global tune settting
'tune.disable-zero-copy-forwarding' may be set in a global section or the
'-dZ' command line parameter may be used to start HAProxy. Of course, this
also disables the kernel splicing.
The PT multiplexer now implements callbacks function to produce and consume
fast-forwarded data. Only splicing is support because the mux-pt does not
use its own buffers.
MINOR: tree-wide: Only rely on co_data() to check channel emptyness
Because channel_is_empty() function does now only check the channel's
buffer, we can remove it and rely on co_data() instead. Of course, all tests
must be inverted.
MEDIUM: channel: don't look at iobuf to report an empty channel
It is important to split channels and I/O buffers. When data are pushed in
an I/O buffer, we consider them as forwarded. The channel never sees
them. Fast-forwarded data are now handled in the SE only.
MEDIUM: mux-h2: Add consumer-side fast-forwarding support
The H2 multiplexer now implements callbacks to consume fast-forwarded
data. It is the most usful case: A H2 client getting data from a H1
server. It is also the easiest case to implement. The producer side is
trickier because of multiplexing. It is not obvious this case would be
improved with data fast-forwarding.
MINOR: h2: Set the BODYLESS_RESP flag on the HTX start-line if necessary
When message headers are parsed and an HTX start-line is created, if we
detect the response must not have any payload, a specific flag must be set
on the HTX start-line. It happens for instance for response to HEAD
requests. This flag is useb by the multiplexers to know response payload, if
any, must be silently skipped.
This was not performed when h2 HEADERS frames were decoded. This HTX flag
was specifically added to fix a bug when the splicing is inuse. Thus the H2
multiplexer was not concerned. Because the mux-to-mux fast-forwarding will
be introduced, it is important handle this flag in the H2 multiplexer too.
MEDIUM: mux-h1: Simplify payload formatting based on HTX blocks on sending path
Just like for the zero-copy, this patch tries to simplify the code
responsible to format the message payload before sending it. But here, we
take care to simplify the loop on the HTX blocks. The result should be
less errorrpone.
MEDIUM: mux-h1: Simplify zero-copy on sending path
In h1_make_data(), the function responsible to format the message payload
before sending it, the code dealing with zero-copy was slighly simplified
(at least for me :).
There is no real change but there is a better split between messages with a
content-length and cunked messages.
MINOR: mux-h1: Add function to add size of a chunk to an outgoind message
This function should be used to send the chunk size, before appending the
chunk payload. It also takes care to add a CRLF to finish a previous chunk,
if necessary. This function will be used to fix the splicing for re-chunk
responses with an unknown length.
MEDIUM: raw-sock: Specifiy amount of data to send via snd_pipe callback
When data were sent using the kernel splicing, we tried to send all data
with no restriction. Most of time it is valid. However, because the payload
representation may differ between the producer and the consumer, it is
important to be able to specify how must data to send via the splicing.
Of course, for performance reason, it is important to maximize amount of
data send via splicing at each call. However, on edge-cases, this now can be
limited.
MEDIUM: mux-h1: Properly handle state transitions of chunked outgoing messages
On the sending path, there are 3 states for chunked payload in H1:
* H1_MSG_CHUNK_SIZE: the chunk size must be emitted
* H1_MSH_CHUNK_CRLF: The end of the chunk must be emitted
* H1_MSG_DATA: Chunked data must be emitted
However, some shortcuts were used on the sending path to avoid some
transitions. Especially, outgoing messages were never switched in
H1_MSG_CHUNK_SIZE state.
However, it will be necessary to properly handle all transitions on the payload
to implement mux-to-mux forwarding, to be sure to always known when the chunk
size or the end of the chunk must be emitted.
MINOR: mux-h1: Use HTX extra field only for responses with known length
For now, it is not an issue, but it is safer to explicitly ignore HTX extra
field for responses with unknown length. This will be mandatory to future
fixes, to be able to re-chunk responses with an unknown length..
MEDIUM: stconn: Add mux-to-mux fast-forward support
Now the kernel splicing support was removed, we can add mux-to-mux
fast-forward support. Of course, the splicing support will be reintroduced
in the muxes themselves but this will be transparent.
Changes are mainly located into sc_conn_recv() and sc_conn_send().
MINOR: connection: Remove mux callbacks about splicing
The kernel splicing support was totally remove waiting for the mux-to-mux
fast-forward implementation. So corresponding mux callbacks can be removed
now.
MINOR: mux-h1: Temporarily remove splicing support
Because the kernel splicing support was removed from the stconn, it is
useless to keep it in muxes. In this patch, we remove the kernel splicing
support from the H1 multiplexer. It will be replaced by the mux-to-mux data
fast-forwarding.
MINOR: mux-pt: Temporarily remove splicing support
Because the kernel splicing support was removed from the stconn, it is
useless to keep it in muxes. In this patch, we remove the kernel splicing
support from the passthough multiplexer. It will be replaced by the
mux-to-mux data fast-forwarding.
MINOR: stconn: Temporarily remove kernel splicing support
mux-to-mux fast-forwarding will be added. To avoid mix with the splicing and
simplify the commits, the kernel splicing support is removed from the
stconn. CF_KERN_SPLICING flag is removed and the support is no longer tested
in process_stream().
In the stconn part, rcv_pipe() callback function is no longer called.
Reg-tests scripts testing the kernel splicing are temporarly marked as
broken.
MINOR: connection: Add new mux callbacks to perform data fast-forwarding
To perform the mux-to-mux data fast-forwarding, 4 new callbacks were added
into the mux_ops structure. 2 callbacks will be used from the stconn for
fast-forward data. The 2 other callbacks will be used by the endpoint to
request an iobuf to the opposite endpoint.
* fastfwd() callback function is used by a producer to forward data
* resume_fastfwd() callback function is used by a consumer if some data are
blocked in the iobuf, to resume the data forwarding.
* init_fastfwd() must be used by an endpoint (the producer one), inside the
fastfwd() callback to request an iobuf to the opposite side (the consumer
one).
* done_fastfwd() must be used by an endpoint (the producer one) at the end
of fastfwd() to notify the opposite endpoint (the consumer one) if data
were forwarded or not.
This API is still under development, so it may evolved. Especially when the
fast-forward will be extended to applets.
2 helper functions were also added into the SE api to wrap init_fastfwd()
and done_fastfwd() callback function of the underlying endpoint.
For now, this API is unsed and not implemented at all in muxes.
MINOR: stconn: Extend iobuf to handle a buffer in addition to a pipe
It is unused for now, but the iobuf structure now owns a pointer to a
buffer. This buffer will be used to perform mux-to-mux fast-forwarding when
splicing is not supported or unusable. This pointer should be filled by an
endpoint to let the opposite one forward data.
Extra fields, in addition to the buffer, are mandatory because the buffer
may already contains some data. the ".offset" field may be used may be used
as the position to start to copy data. Finally, the amount of data copied in
this buffer must be saved in ".data" field.
Some flags are also added to prepare next changes. And helper stconn
fnuctions are updated to also count data in the buffer. For a first
implementation, it is not planned to handle data in the buffer and in the
pipe in same time. But it will be possible to do so.
MINOR: stconn: Start to introduce mux-to-mux fast-forwarding notion
Instead of talking about kernel splicing at stconn/sedesc level, we now try
to talk about mux-to-mux fast-forwarding. To do so, 2 functions were added
to know if there are fast-forwarded data and to retrieve this amount of
data. Of course, for now, there is only data in a pipe.
In addition, some flags were renamed to reflect this notion. Note the
channel's documentation was not updated yet.
MEDIUM: stconn/channel: Move pipes used for the splicing in the SE descriptors
The pipes used to put data when the kernel splicing is in used are moved in
the SE descriptors. For now, it is just a simple remplacement but there is a
major difference with the pipes in the channel. The data are pushed in the
consumer's pipe while it was pushed in the producer's pipe. So it means the
request data are now pushed in the pipe of the backend SE descriptor and
response data are pushed in the pipe of the frontend SE descriptor.
The idea is to hide the pipe from the channel/SC side and to be able to
handle fast-forwading in pipe but also in buffer. To do so, the pipe is
inside a new entity, called iobuf. This entity will be extended.
BUG/MEDIUM: mux-h2: Don't report an error on shutr if a shutw is pending
If a shutw is blocked because the mux is full or busy, we must defer the
shutr. In this case, the H2 stream is not in H2_SS_CLOSED state because the
shutw is also deferred. If the shutr is performed, this will lead to a
error.
Concretly, when the mux is unblocked, a RST_STREAM is sent while in some
cases, an empty DATA frame with ES flag set could be sent.
This patch should be backported to all stable versions.
BUG/MINOR: htpp-ana/stats: Specify that HTX redirect messages have a C-L header
Redirect responses sent during the HTTP analysis have no payload. However
there is still a "Content-Length" header. It is important to set the
corresponding flag on the HTX start-line to be sure to preserve this header
when the reponse is sent to the client. The same is true with the stats
applet, when it returns a redirect responses.
It is especially important because we no ignore in-fly modifications of
"Content-Length" or "Transfer-Encoding" headers without updating the HTX
start-line flags.
This patch may be backported to all stable versions but it is probably
useless because only the 2.9-dev is affected by the bug.
BUG/MEDIUM: mux-h1: do not forget TLR/EOT even when no data is sent
Since commit 723c73f8a ("MEDIUM: mux-h1: Split h1_process_mux() to make code
more readable"), outgoing H1 chunked messages with no data at all get
delayed by 200ms. It is due to the fact that we end processing too early and
we don't have the opportunity to process trailers in this case.
This fix addresses it by verifying if it's required to emit EOT or trailers,
if any, when retruning from h1_make_data()
CLEANUP: hlua: Remove dead-code on error path in hlua_socket_new()
Since last fixes about the lua cosocket, the appctx is no longer initialized
in hlua_socket_new(). The code to deal with error at this stage can be
removed.
Willy Tarreau [Tue, 17 Oct 2023 15:00:10 +0000 (17:00 +0200)]
BUG/MEDIUM: quic_conn: let the scheduler kill the task when needed
The two timer handlers qc_process_timer() and qc_idle_timer_task() would
inadvertently return NULL when they don't want to be requeued, instead
of just returning the task itself. The effect of returning NULL for the
scheduler is that it considers the task as freed, so it must not touch
it anymore. As such, the TASK_F_RUNNING flag is never removed from these
tasks, and when quic_conn_release() later tries to release these tasks
using task_destroy(), the latter sees the RUNNING flag and just sets
->process to NULL, hoping that the scheduler will kill them on return,
but there's no longer being executed so this never happens and they are
leaked.
Interestingly, this doesn't seem to happen as much when multi-queue is
set to off, but it's likely because the tasks are being replaced and the
first ones have already been woken up and leaked, while the latter might
only trigger on a timeout or timer renewal.
This should address github issue #2310. Thanks to @hpn0t0ad for the
numerous traces that helped understand this sequence.
This must be backported to 2.7 at least, and adapted for 2.6
(qc_idle_timer_task must return t there).
Willy Tarreau [Tue, 17 Oct 2023 09:13:00 +0000 (11:13 +0200)]
DEBUG: pool: store the memprof bin on alloc() and update it on free()
When looking at "show pools", it's often difficult to know which alloc()
corresponds to which free() since it's not often 1:1. But sometimes we
have all elements available to maintain a link between alloc and free.
Indeed, when the caller is recorded in the allocated area, we can store
the pointer to the just created bin instead of the caller address itself,
since the caller address is already in the memprof bin. By doing so, we
permit the pool_free() call to locate the allocator bin and update its
free count when caller tracing is enabled. This for example allows to
produce outputs like this on "show profiling" and a process started with
-dMcaller:
Willy Tarreau [Tue, 17 Oct 2023 06:25:19 +0000 (08:25 +0200)]
BUG/MINOR: mux-h2: make up other blocked streams upon removal from list
An interesting issue was met when testing the mux-to-mux forwarding code.
In order to preserve fairness, in h2_snd_buf() if other streams are waiting
in send_list or fctl_list, the stream that is attempting to send also goes
to its list, and will be woken up by h2_process_mux() or h2_send() when
some space is released. But on rare occasions, there are only a few (or
even a single) streams waiting in this list, and these streams are just
quickly removed because of a timeout or a quick h2_detach() that calls
h2s_destroy(). In this case there's no even to wake up the other waiting
stream in its list, and this will possibly resume processing after some
client WINDOW_UPDATE frames or even new streams, so usually it doesn't
last too long and it not much noticeable, reason why it was left that
long. In addition, measures have shown that in heavy network-bound
benchmark, this exact situation happens on less than 1% of the streams
(reached 4% with mux-mux).
The fix here consists in replacing these LIST_DEL_INIT() calls on
h2s->list with a function call that checks if other streams were queued
to the send_list recently, and if so, which also tries to resume them
by calling h2_resume_each_sending_h2s(). The detection of late additions
is made via a new flag on the connection, H2_CF_WAIT_INLIST, which is set
when a stream is queued due to other streams being present, and which is
cleared when this is function is called.
It is particularly difficult to reproduce this case which is particularly
timing-dependent, but in a constrained environment, a test involving 32
conns of 20 streams each, all downloading a 10 MB object previously
showed a limitation of 17 Gbps with lots of idle CPU time, and now
filled the cable at 25 Gbps.
This should be backported to all versions where it applies.
BUG/MINOR: mux-h1: Send a 400-bad-request on shutdown before the first request
Except if we must silently ignore empty connections by enabling
http-ignore-probes or dontlognull options, when a client connection is
closed before the first request, a 400-bad-request response must be sent
with the corresponding log message. However, that is broken since the commit fc473a6453 ("MEDIUM: mux-h1: Rely on the H1C to deal with shutdown for
reads").
The bug is subtle. Parsing errors are no longer reported on connection errors
before the first request while it should be.
This patch must be backported where the above commit is (as far as 2.7).
BUG/MEDIUM: applet: Report a send activity everytime data were sent
In the same way than for stream-connectors (see "BUG/MEDIUM: stconn: Report
a send activity everytime data were sent" for details), we now report a send
activity everytime something was consumed by an applet, even if some output
data remains blocked into the channel's buffer.
BUG/MEDIUM: stconn: Report a send activity everytime data were sent
When read/write timeouts were refactored in 2.8, we decided to change when a
send activity had to be reported. Before, everytime some data were sent a
send activity were reported. At this time, the channel's wex timer were
updated. During the refactoring, we decided to limit send activity to sends
that ampty te channel's buffer, consuming all outgoing data. Idea behind
this change was to protect haproxy against clients consumming data very
slowly.
However, it is too strict. Some congested muxes but still active can hit the
client or the server timeout. It seems a bit unfair. It is especially
visible with QUIC/H3 but it is probably also possible with H2 if the window
size is small.
"log-bufsize" may now be used for a log server (in a log backend) to
configure the bufsize of implicit ring associated to the server (which
defaults to BUFSIZE).
REGTEST: add a test for log-backend used as a log target
This regtest declares and uses 3 log backends, one of which has TCP syslog
servers declared in it and other ones UDP syslog servers.
Some tests aims at testing log distribution reliability by leveraging the
log-balance hash algorithm with a key extracted from the request URL, and
the dummy vtest syslog servers ensure that messages are sent to the
correct endpoint. Overall this regtest covers essential parts of the log
message distribution and log-balancing logic involved with log backends.
It also leverages the log-forward section to perform the TCP->UDP
translation required to test UDP endpoints since vtest syslog servers
work in UDP mode.
Finally, we have some tests to ensure that the server queuing/dequeuing
and failover (backup) logics work properly.
MEDIUM: log/balance: support for the "hash" lb algorithm
hash lb algorithm can be configured with the "log-balance hash <cnv_list>"
directive. With this algorithm, the user specifies a converter list with
<cnv_list>.
The produced log message will be passed as-is to the provided converter
list, and the resulting hash will be used to select the log server that
will receive the log message.
split sample_process() in 2 parts in order to be able to only process
the converter part of a sample expression from an existing input sample
struct passed as parameter.
MINOR: lbprm: compute the hash avalanche in gen_hash()
Instead of systematically computing the avalanche hash right after the
gen_hash() call, do it inside the gen_hash() function directly to ensure
avalanche setting is always considered.
MINOR: lbprm: support for the "none" hash-type function
Allow the use of the "none" hash-type function so that the key resulting
from the sample expression is directly used as the hash.
This can be useful to do the hashing manually using available hashing
converters, or even custom ones, and then inform haproxy that it can
directly rely on the sample expression result which is explictly handled
as an integer in this case.
MINOR: log/balance: support for the "random" lb algorithm
In this patch we add basic support for the random algorithm:
random algorithm picks a random server using the result of the
statistical_prng() function as if it was a hash key to then compute the
related server ID.
There is no support for the <draw> parameter (which is implemented for
tcp/http load-balancing), because we don't have the required metrics to
evaluate server's load in log backends for the moment. Plus it would add
more complexity to the __do_send_log_backend() function so we'll keep it
this way for now but this might be needed in the future.
MINOR: log/balance: support for the "sticky" lb algorithm
sticky algorithm always tries to send log messages to the first server in
the farm. The server will stay in front during queue and dequeue
operations (no other server can steal its place), unless it becomes
unavailable, in which case it will be replaced by another server from
the tree.
Using "mode log" in a backend section turns the proxy in a log backend
which can be used to log-balance logs between multiple log targets
(udp or tcp servers)
log backends can be used as regular log targets using the log directive
with "backend@be_name" prefix, like so:
| log backend@mybackend local0
A log backend will distribute log messages to servers according to the
log load-balancing algorithm that can be set using the "log-balance"
option from the log backend section. For now, only the roundrobin
algorithm is supported and set by default.
This helper function can be used to create a new sink from an existing
server struct (and thus existing proxy as well), in order to spare some
resources when possible.
MEDIUM: sink: inherit from caller fmt in ring_write() when rings didn't set one
implicit rings were automatically forced to the parent logger format, but
this was done upon ring creation.
This is quite restrictive because we might want to choose the desired
format right before generating the log header (ie: when producing the
log message), depending on the logger (log directive) that is
responsible for the log message, and with current logic this is not
possible. (To this day, we still have dedicated implicit ring per log
directive, but this might change)
In ring_write(), we check if the sink->fmt is specified:
- defined: we use it since it is the most precise format
(ie: for named rings)
- undefined: then we fallback to the format from the logger
With this change, implicit rings' format is now set to UNSPEC upon
creation. This is safe because the log header building function
automatically enforces the "raw" format when UNSPEC is set. And since
logger->format also defaults to "raw", no change of default behavior
should be expected.
MEDIUM: sink/log: stop relying on AF_UNSPEC for rings
Since a5b325f92 ("MINOR: protocol: add a real family for existing FDs"),
we don't rely anymore on AF_UNSPEC for buffer rings in do_send_log.
But we kept it as a parsing hint to differentiate between implicit and
named rings during ring buffer postparsing.
However it is still a bit confusing and forces us to systematically rely
on target->addr, even for named buffer rings where it doesn't make much
sense anymore.
Now that target->addr was made a pointer in a recent commit, we can
choose not to initialize it when not needed (i.e.: named rings) and use
this as a hint to distinguish implicit rings during init since they rely
on the addr struct to temporarily store the ring's address until the ring
is actually created during postparsing step.
DOC: config: log <address> becomes log <target> in "log" related doc
This is a follow up of the previous commit to emphasize that "log"
directive allows to provide a log target which may directly be a server
address but may also be a log transport facility such as rings. Thus we
use the term "target" instead of "address" to make it more generic.
log targets were immediately embedded in logger struct (previously
named logsrv) and could not be used outside of this context.
In this patch, we're introducing log_target type with the associated
helper functions so that it becomes possible to declare and use log
targets outside of loggers scope.
When 'log' directive was implemented, the internal representation was
named 'struct logsrv', because the 'log' directive would directly point
to the log target, which used to be a (UDP) log server exclusively at
that time, hence the name.
But things have become more complex, since today 'log' directive can point
to ring targets (implicit, or named) for example.
Indeed, a 'log' directive does no longer reference the "final" server to
which the log will be sent, but instead it describes which log API and
parameters to use for transporting the log messages to the proper log
destination.
So now the term 'logsrv' is rather confusing and prevents us from
introducing a new level of abstraction because they would be mixed
with logsrv.
So in order to better designate this 'log' directive, and make it more
generic, we chose the word 'logger' which now replaces logsrv everywhere
it was used in the code (including related comments).
This is internal rewording, so no functional change should be expected
on user-side.
Amaury Denoyelle [Thu, 12 Oct 2023 16:15:01 +0000 (18:15 +0200)]
BUG/MEDIUM: quic-conn: free unsent frames on retransmit to prevent crash
Since the following patch :
commit 33c49cec987c1dcd42d216c6d075fb8260058b16
MINOR: quic: Make qc_dgrams_retransmit() return a status.
retransmission process is interrupted as soon as a fatal send error has
been encounted. However, this may leave frames in local list. This cause
several issues : a memory leak and a potential crash.
The crash happens because leaked frames are duplicated of an origin
frame via qc_dup_pkt_frms(). If an ACK arrives later for the origin
frame, all duplicated frames are also freed. During qc_frm_free(),
LIST_DEL_INIT() operation is invalid as it still references the local
list used inside qc_dgrams_retransmit().
This bug was reproduced using the following injection from another
machine :
$ h2load --npn-list h3 -t 8 -c 10000 -m 1 -n 2000000000 \
https://<host>:<port>/?s=4m
Haproxy was compiled using ASAN. The crash resulted in the following
trace :
==332748==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7fff82bf9d78 at pc 0x556facd3b95a bp 0x7fff82bf8b20 sp 0x7fff82bf8b10
WRITE of size 8 at 0x7fff82bf9d78 thread T0
#0 0x556facd3b959 in qc_frm_free include/haproxy/quic_frame.h:273
#1 0x556facd59501 in qc_release_frm src/quic_conn.c:1724
#2 0x556facd5a07f in quic_stream_try_to_consume src/quic_conn.c:1803
#3 0x556facd5abe9 in qc_treat_acked_tx_frm src/quic_conn.c:1866
#4 0x556facd5b3d8 in qc_ackrng_pkts src/quic_conn.c:1928
#5 0x556facd60187 in qc_parse_ack_frm src/quic_conn.c:2354
#6 0x556facd693a1 in qc_parse_pkt_frms src/quic_conn.c:3203
#7 0x556facd7531a in qc_treat_rx_pkts src/quic_conn.c:4606
#8 0x556facd7a528 in quic_conn_app_io_cb src/quic_conn.c:5059
#9 0x556fad3284be in run_tasks_from_lists src/task.c:596
#10 0x556fad32a3fa in process_runnable_tasks src/task.c:876
#11 0x556fad24a676 in run_poll_loop src/haproxy.c:2968
#12 0x556fad24b510 in run_thread_poll_loop src/haproxy.c:3167
#13 0x556fad24e7ff in main src/haproxy.c:3857
#14 0x7fae30ddd0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x240b2)
#15 0x556facc9375d in _start (/opt/haproxy-quic-2.8/haproxy+0x1ea75d)
Address 0x7fff82bf9d78 is located in stack of thread T0 at offset 40 in frame
#0 0x556facd74ede in qc_treat_rx_pkts src/quic_conn.c:4580
Amaury Denoyelle [Wed, 11 Oct 2023 15:32:04 +0000 (17:32 +0200)]
BUG/MINOR: mux-quic: fix free on qcs-new fail alloc
qcs_new() allocates several elements in intermediary steps. All elements
must first be properly initialized to be able to free qcs instance in
case of an intermediary failure.
Previously, qc_stream_desc allocation was done in the middle of
qcs_new() before some elements initializations. In case this fails, a
crash can happened as some elements are left uninitialized.
To fix this, move qc_stream_desc allocation at the end of qcs_new().
This ensures that all qcs elements are initialized first.
Amaury Denoyelle [Wed, 11 Oct 2023 14:04:35 +0000 (16:04 +0200)]
BUG/MINOR: quic: fix free on quic-conn fail alloc
qc_new_conn() allocates several elements in intermediary steps. If one
of the fails, a global free is done on the quic_conn and its elements.
This requires that most elements are first initialized to NULL or
equivalent to ensure freeing operation is done only on proper values.
Once of this element is qc.tx.cc_buf_area. It was initialized too late
which could caused crashes. This is introduced by 9f7cfb0a56352188854bdaef9617ca836c2a30c9
MEDIUM: quic: Allow the quic_conn memory to be asap released.
Amaury Denoyelle [Wed, 11 Oct 2023 13:40:38 +0000 (15:40 +0200)]
BUG/MINOR: quic: fix qc.cids access on quic-conn fail alloc
CIDs tree is now allocated dynamically since the following commit : 276697438d50456f92487c990f20c4d726dfdb96
MINOR: quic: Use a pool for the connection ID tree.
This can caused a crash if qc_new_conn() is interrupted due to an
intermediary failed allocation. When freeing all connection members,
free_quic_conn_cids() is used. However, this function does not support a
NULL cids.
To fix this, simply check that cids is NULL during free_quic_conn_cids()
prologue.
Willy Tarreau [Thu, 12 Oct 2023 12:01:49 +0000 (14:01 +0200)]
BUG/MAJOR: connection: make sure to always remove a connection from the tree
Since commit 5afcb686b ("MAJOR: connection: purge idle conn by last usage")
in 2.9-dev4, the test on conn->toremove_list added to conn_get_idle_flag()
in 2.8 by commit 3a7b539b1 ("BUG/MEDIUM: connection: Preserve flags when a
conn is removed from an idle list") becomes misleading. Indeed, now both
toremove_list and idle_list are shared by a union since the presence in
these lists is mutually exclusive. However, in conn_get_idle_flag() we
check for the presence in the toremove_list to decide whether or not to
delete the connection from the tree. This test now fails because instead
it sees the presence in the idle or safe list via the union, and concludes
the element must not be removed. Thus the element remains in the tree and
can be found later after the connection is released, causing crashes that
Tristan reported in issue #2292.
The following config is sufficient to reproduce it with 2 threads:
With Amaury we analyzed the conditions in which the function is called
in order to figure a better condition for the test and concluded that
->toremove_list is never filled there so we can safely remove that part
from the test and just move the flag retrieval back to what it was prior
to the 2.8 patch above. Note that the patch is not reverted though, as
the parts that would drop the unexpected flags removal are unchanged.
This patch must NOT be backported. The code in 2.8 works correctly, it's
only the change in 2.9 that makes it misbehave.
Willy Tarreau [Thu, 12 Oct 2023 12:14:20 +0000 (14:14 +0200)]
CLEANUP: connection: drop an uneeded leftover cast
In conn_delete_from_tree() there remains a cast of the toremove_list
to struct list while the introduction of the union precisely was to
avoid this cast. It's a leftover from the first version of patch 5afcb686b ("MAJOR: connection: purge idle conn by last usage") merged
into in 2.9-dev4, let's fix that.
HTTP/3 specification has several requirement when parsing authority or
host header inside a request. However, it was until then only partially
implemented.
This commit fixes this by ensuring the following :
* reject an empty authority/host header
* reject a host header if an authority was found with a different value
* no authority neither host header present
BUG/MINOR: mux-quic: support initial 0 max-stream-data
Support stream opening with an initial max-stream-data of 0.
In normal case, QC_SF_BLK_SFCTL is set when a qcs instance cannot
transfer more data due to flow-control. This flag is set when
transfering data from MUX to quic-conn instance.
However, it's possible to define an initial value of 0 for
max-stream-data. In this case, qcs instance is blocked despite
QC_SF_BLK_SFCTL not set. No STREAM frame is prepared for this stream as
it's not possible to emit any byte, so QC_SF_BLK_SFCTL flag is never
set.
This behavior should cause no harm. However, this can cause a BUG_ON()
crash on qcc_io_send(). Indeed, when sending is retried, it ensures that
only qcs instance waiting for a new qc_stream_buf or with
QC_SF_BLK_SFCTL set is present in the send_list.
To fix this, initialize qcs with 0 value for msd and QC_SF_BLK_SFCTL.
The flag is removed only if transport parameter msd value is non null.
BUG/MEDIUM: mux-quic: fix RESET_STREAM on send-only stream
When receiving a RESET_STREAM on a send-only stream, it is mandatory to
close the connection with an error STREAM_STATE error. However, this was
badly implemented as this caused two invocation of qcc_set_error() which
is forbidden by the mux-quic API.
To fix this, rely on qcc_get_qcs() to properly detect the error. Remove
qcc_set_error() usage from qcc_recv_reset_stream() instead.
RFC 9000 indicates that a QUIC packet with no frame must trigger a
connection closure with PROTOCOL_VIOLATION error code. Implement this
via an early return inside qc_parse_pkt_frms().
Move all QUIC trace definitions from quic_conn.h to quic_trace-t.h. Also
remove multiple definition trace_quic macro definition into
quic_trace.h. This forces all QUIC source files who relies on trace to
include it while reducing the size of quic_conn.h.
BUG/MINOR: quic: Avoid crashing with unsupported cryptographic algos
This bug was detected when compiling haproxy against aws-lc TLS stack
during QUIC interop runner tests. Some algorithms could be negotiated by haproxy
through the TLS stack but not fully supported by haproxy QUIC implentation.
This leaded tls_aead() to return NULL (same thing for tls_md(), tls_hp()).
As these functions returned values were never checked, they could triggered
segfaults.
To fix this, one closes the connection as soon as possible with a
handshake_failure(40) TLS alert. Note that as the TLS stack successfully
negotiates an algorithm, it provides haproxy with CRYPTO data before entering
->set_encryption_secrets() callback. This is why this callback
(ha_set_encryption_secrets() on haproxy side) is modified to release all
the CRYPTO frames before triggering a CONNECTION_CLOSE with a TLS alert. This is
done calling qc_release_pktns_frms() for all the packet number spaces.
Modify some quic_tls_keys_hexdump to avoid crashes when the ->aead or ->hp EVP_CIPHER
are NULL.
Modify qc_release_pktns_frms() to do nothing if the packet number space passed
as parameter is not intialized.
This bug does not impact the QUIC TLS compatibily mode (USE_QUIC_OPENSSL_COMPAT).
Thank you to @ilia-shipitsin for having reported this issue in GH #2309.