BUG/MEDIUM: mux-pt: Never fully close the connection on shutdown
When a shutdown is reported to the mux (shutdown for reads or writes), the
connexion is immediately fully closed if the mux detects the connexion is
closed in both directions. Only the passthrough multiplexer is able to
perform this action at this stage because there is no stream and no internal
data. Other muxes perform a full connection close during the mux's release
stage. It was working quite well since recently. But, in theory, the bug is
quite old.
In fact, it seems possible for the lower layer to report an error on the
connection in same time a shutdown is performed on the mux. Depending on how
events are scheduled, the following may happen:
1. An connection error is detected at the fd layer and a wakeup is
scheduled on the mux to handle the event.
2. A shutdown for writes is performed on the mux. Here the mux decides to
fully close the connexion. If the xprt is not used to log info, it is
released.
3. The mux is finally woken up. It tries to retrieve data from the xprt
because it is not awayre there was an error. This leads to a crash
because of a NULL-deref.
By reading the code, it is not obvious. But it seems possible with SSL
connection when the handshake is rearmed. It happens when a
SSL_ERROR_WANT_WRITE is reported on a SSL_read() attempt or a
SSL_ERROR_WANT_READ on a SSL_write() attempt.
This bug is only visible if the XPRT is not used to log info. So it is no so
common.
This patch should fix the 2nd crash reported in the issue #2656. It must
first be backported as far as 2.9 and then slowly to all stable versions.
MEDIUM: bwlim: Use a read-lock on the sticky session to apply a shared limit
There is no reason to acquire a write-lock on the sticky session when a
shared limit is applied because only the frequency is updated. The sticky
session itself is not modified. We must just take care it is not removed in
the mean time. So a read-lock may be used instead.
MEDIUM: stick-table: Add support of a factor for IN/OUT bytes rates
Add a factor parameter to stick-tables, called "brates-factor", that is
applied to in/out bytes rates to work around the 32-bits limit of the
frequency counters. Thanks to this factor, it is possible to have bytes
rates beyond the 4GB. Instead of counting each bytes, we count blocks
of bytes. Among other things, it will be useful for the bwlim filter, to be
able to configure shared limit exceeding the 4GB/s.
For now, this parameter must be in the range ]0-1024].
BUG/MINOR: quic: Crash from trace dumping SSL eary data status (AWS-LC)
This bug follows this patch:
MINOR: quic: Add trace for QUIC_EV_CONN_IO_CB event.
where a new third variable was added to be dumped from QUIC_EV_CONN_IO_CB trace
event. The quic_trace() code did not reveal there was already another variable
passed as third argument but not dumped. This leaded to crash when dereferencing
a point to an int in place of a point to an SSL object.
This issue was reproduced only by handshakecorruption aws-lc interop test with
s2n-quic as client.
Note that this patch must be backported with this one:
BUG/MEDIUM: quic: always validate sender address on 0-RTT
which depends on the commit mentionned above.
Aperence [Mon, 26 Aug 2024 09:50:27 +0000 (11:50 +0200)]
MEDIUM: protocol: add MPTCP per address support
Multipath TCP (MPTCP), standardized in RFC8684 [1], is a TCP extension
that enables a TCP connection to use different paths.
Multipath TCP has been used for several use cases. On smartphones, MPTCP
enables seamless handovers between cellular and Wi-Fi networks while
preserving established connections. This use-case is what pushed Apple
to use MPTCP since 2013 in multiple applications [2]. On dual-stack
hosts, Multipath TCP enables the TCP connection to automatically use the
best performing path, either IPv4 or IPv6. If one path fails, MPTCP
automatically uses the other path.
To benefit from MPTCP, both the client and the server have to support
it. Multipath TCP is a backward-compatible TCP extension that is enabled
by default on recent Linux distributions (Debian, Ubuntu, Redhat, ...).
Multipath TCP is included in the Linux kernel since version 5.6 [3]. To
use it on Linux, an application must explicitly enable it when creating
the socket. No need to change anything else in the application.
This attached patch adds MPTCP per address support, to be used with:
mptcp{,4,6}@<address>[:port1[-port2]]
MPTCP v4 and v6 protocols have been added: they are mainly a copy of the
TCP ones, with small differences: names, proto, and receivers lists.
These protocols are stored in __protocol_by_family, as an alternative to
TCP, similar to what has been done with QUIC. By doing that, the size of
__protocol_by_family has not been increased, and it behaves like TCP.
MPTCP is both supported for the frontend and backend sides.
Also added an example of configuration using mptcp along with a backend
allowing to experiment with it.
Note that this is a re-implementation of Björn's work from 3 years ago
[4], when haproxy's internals were probably less ready to deal with
this, causing his work to be left pending for a while.
Currently, the TCP_MAXSEG socket option doesn't seem to be supported
with MPTCP [5]. This results in a warning when trying to set the MSS of
sockets in proto_tcp:tcp_bind_listener.
This can be resolved by adding two new variables:
sock_inet(6)_mptcp_maxseg_default that will hold the default
value of the TCP_MAXSEG option. Note that for the moment, this
will always be -1 as the option isn't supported. However, in the
future, when the support for this option will be added, it should
contain the correct value for the MSS, allowing to correctly
set the TCP_MAXSEG option.
Aperence [Mon, 26 Aug 2024 09:50:26 +0000 (11:50 +0200)]
MEDIUM: sock: use protocol when creating socket
Use the protocol configured for a connection when creating the socket,
instead of always using 0.
This change is needed to allow new protocol to be used when creating
the sockets, such as MPTCP. Note however that this patch won't change
anything for now, as the only other value that proto->sock_prot could
hold is IPPROTO_TCP, which has the same behavior as 0 when passed to
socket.
Willy Tarreau [Fri, 30 Aug 2024 16:52:33 +0000 (18:52 +0200)]
BUILD: quic: fix build errors on FreeBSD since recent GSO changes
The following commits broke the build on FreeBSD when QUIC is enabled:
35470d518 ("MINOR: quic: activate UDP GSO for QUIC if supported") 448d3d388 ("MINOR: quic: add GSO parameter on quic_sock send API")
Indeed, it turns out that netinet/udp.h requires sys/types.h to be
included before. Let's just change the includes order to fix the build.
No backport is needed.
BUG/MEDIUM: quic: always validate sender address on 0-RTT
It has been reported by Wedl Michael, a student at the University of Applied
Sciences St. Poelten, a potential vulnerability into haproxy as described below.
An attacker could have obtained a TLS session ticket after having established
a connection to an haproxy QUIC listener, using its real IP address. The
attacker has not even to send a application level request (HTTP3). Then
the attacker could open a 0-RTT session with a spoofed IP address
trusted by the QUIC listen to bypass IP allow/block list and send HTTP3 requests.
To mitigate this vulnerability, one decided to use a token which can be provided
to the client each time it successfully managed to connect to haproxy. These
tokens may be reused for future connections to validate the address/path of the
remote peer as this is done with the Retry token which is used for the current
connection, not the next one. Such tokens are transported by NEW_TOKEN frames
which was not used at this time by haproxy.
So, each time a client connect to an haproxy QUIC listener with 0-RTT
enabled, it is provided with such a token which can be reused for the
next 0-RTT session. If no such a token is presented by the client,
haproxy checks if the session is a 0-RTT one, so with early-data presented
by the client. Contrary to the Retry token, the decision to refuse the
connection is made only when the TLS stack has been provided with
enough early-data from the Initial ClientHello TLS message and when
these data have been accepted. Hopefully, this event arrives fast enough
to allow haproxy to kill the connection if some early-data have been accepted
without token presented by the client.
quic_build_post_handshake_frames() has been modified to build a NEW_TOKEN
frame with this newly implemented token to be transported inside.
quic_tls_derive_retry_token_secret() was renamed to quic_do_tls_derive_token_secre()
and modified to be reused and derive the secret for the new token implementation.
quic_token_validate() has been implemented to validate both the Retry and
the new token implemented by this patch. When this is a non-retry token
which could not be validated, the datagram received is marked as requiring
a Retry packet to be sent, and no connection is created.
When the Initial packet does not embed any non-retry token and if 0-RTT is enabled
the connection is marked with this new flag: QUIC_FL_CONN_NO_TOKEN_RCVD. As soon
as the TLS stack detects that some early-data have been provided and accepted by
the client, the connection is marked to be killed (QUIC_FL_CONN_TO_KILL) from
ha_quic_add_handshake_data(). This is done calling qc_ssl_eary_data_accepted()
new function. The secret TLS handshake is interrupted as soon as possible returnin
0 from ha_quic_add_handshake_data(). The connection is also marked as
requiring a Retry packet to be sent (QUIC_FL_CONN_SEND_RETRY) from
ha_quic_add_handshake_data(). The the handshake I/O handler (quic_conn_io_cb())
knows how to behave: kill the connection after having sent a Retry packet.
About TLS stack compatibility, this patch is supported by aws-lc. It is
disabled for wolfssl which does not support 0-RTT at this time thanks
to HAVE_SSL_0RTT_QUIC.
MINOR: quic: Add trace for QUIC_EV_CONN_IO_CB event.
Dump the early data status from QUIC_EV_CONN_IO_CB trace event.
This is very helpful to know if the QUIC server has accepted the
early data received from clients.
This function is a wrapper around SSL_get_early_data_status() for
OpenSSL derived stack and SSL_early_data_accepted() boringSSL derived
stacks like AWS-LC. It returns true for a TLS server if it has
accepted the early data received from a client.
Also implement quic_ssl_early_data_status_str() which is dedicated to be used
for debugging purposes (traces). This function converts the enum returned
by the two function mentionned above to a human readable string.
Modify qf_new_token structure to use a static buffer with QUIC_TOKEN_LEN
as size as defined by the token for future connections (quic_token.c).
Modify consequently the NEW_TOKEN frame parser (see quic_parse_new_token_frame()).
Also add comments to denote that the NEW_TOKEN parser function is used only by
clients and that its builder is used only by servers.
BUG/MINOR: quic: Missing incrementation in NEW_TOKEN frame builder
quic_build_new_token_frame() is the function which is called to build
a NEW_TOKEN frame into a buffer. The position pointer for this buffer
was not updated, leading the NEW_TOKEN frame to be malformed.
MINOR: quic: Token for future connections implementation.
There exist two sorts of token used by QUIC. They are both used to validate
the peer address (path validation). Retry are used for the current
connection the client want to open. This patch implement the other
sort of tokens which after having been received from a connection, may
be provided for the next connection from the same IP address to validate
it (or validate the network path between the client and the server).
The token generation is implemented by quic_generate_token(), and
the token validation by quic_token_chek(). The same method
is used as for Retry tokens to build such tokens to be reused for
future connections. The format is very simple: one byte for the format
identifier to distinguish these new tokens for the Retry token, followed
by a 32bits timestamps. As this part is ciphered with AEAD as cryptographic
algorithm, 16 bytes are needed for the AEAD tag. 16 more random bytes
are added to this token and a salt to derive the AEAD secret used
to cipher the token. In addition to this salt, this is the client IP address
which is used also as AAD to derive the AEAD secret. So, the length of
the token is fixed: 37 bytes.
This is function is similar to quic_tls_derive_retry_token_secret().
Its aim is to derive the secret used to cipher the token to be used
for future connections.
This patch renames quic_tls_derive_retry_token_secret() to a more
and reuses its code to produce a more generic one: quic_do_tls_derive_token_secret().
Two arguments are added to this latter to produce both quic_tls_derive_retry_token_secret()
and quic_tls_derive_token_secret() new function which calls
quic_do_tls_derive_token_secret().
Nicolas CARPi [Tue, 27 Aug 2024 19:51:26 +0000 (21:51 +0200)]
CLEANUP: mqtt: fix typo in MQTT_REMAINING_LENGHT_MAX_SIZE
There was a typo in the macro name, where LENGTH was incorrectly
written. This didn't cause any issue because the typo appeared in all
occurrences in the codebase.
BUG/MINIR: proxy: Match on 429 status when trying to perform a L7 retry
Support for 429 was recently added to L7 retries (0d142e075 "MINOR: proxy:
Add support of 429-Too-Many-Requests in retry-on status"). But the
l7_status_match() function was not properly updated. The switch statement
must match the 429 status to be able to perform a L7 retry.
This patch must be backported if the commit above is backported. It is
related to #2687.
BUG/MEDIUM: stream: Prevent mux upgrades if client connection is no longer ready
If an early error occurred on the client connection, we must prevent any
multiplexer upgrades. Indeed, it is unexpected for a mux to be initialized
with no xprt. On a normal workflow it is impossible. So it is not an
issue. But if a mux upgrade is performed at the stream level, an early error
on the connection may have already been handled by the previous mux and the
connection may be already fully closed. If the mux upgrade is still
performed, a crash can be experienced.
It is possible to have a crash with an implicit TCP>HTTP upgrade if there is no
data in the input buffer. But it is also possible to get a crash with an
explicit "switch-mode http" rule.
It must be backported to all stable versions. In 2.2, the patch must be
applied directly in stream_set_backend() function.
BUG/MEDIUM: mux-h2: Set ES flag when necessary on 0-copy data forwarding
When DATA frames are sent via the 0-copy data forwarding, we must take care
to set the ES flag on the last DATA frame. It should be performed in
h2_done_ff() when IOBUF_FL_EOI flag was set by the producer. This flag is
here to know when the producer has reached the end of input. When this
happens, the h2s state is also updated. It is switched to "half-closed
local" or "closed" state depending on its previous state.
It is mainly an issue on uploads because the server may be blocked waiting
for the end of the request. A workaround is to disable the 0-copy forwarding
support the the H2 by setting "tune.h2.zero-copy-fwd-send" directive to off
in your global section.
This patch should fix the issue #2665. It must be backported as far as 2.9.
MEDIUM: ssl: capture the signature_algorithms extension from Client Hello
Activate the capture of the TLS signature_algorithms extension from the
Client Hello. This list is stored in the ssl_capture buffer when the
global option "tune.ssl.capture-cipherlist-size" is enabled.
MEDIUM: ssl: capture the supported_versions extension from Client Hello
Activate the capture of the TLS supported_versions extension from the
Client Hello. This list is stored in the ssl_capture buffer when the
global option "tune.ssl.capture-cipherlist-size" is enabled.
BUILD: quic: 32bits build broken by wrong integer conversions for printf()
Since these commits the 32bits build is broken due to several errors as follow:
CC src/quic_cli.o
src/quic_cli.c: In function ‘dump_quic_full’:
src/quic_cli.c:285:94: error: format ‘%ld’ expects argument of type ‘long int’,
but argument 5 has type ‘uint64_t’ {aka ‘long long unsigned int’} [-Werror=format=]
285 | chunk_appendf(&trash, " [initl] rx.ackrng=%-6zu tx.inflight=%-6zu(%ld%%)\n",
| ~~^
| |
| long int
| %lld
286 | pktns->rx.arngs.sz, pktns->tx.in_flight,
287 | pktns->tx.in_flight * 100 / qc->path->cwnd);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| uint64_t {aka long long unsigned int}
Replace several %ld by %llu with ull as printf conversion in quic_clic.c and a
%ld by %lld with (long long) as printf conversion in quic_cc_cubic.c.
Thank you to Ilya (@chipitsine) for having reported this issue in GH #2689.
BUG/MINOR: haproxy: free init_env in deinit only if allocated
This fixes 7b78e1571 (" MINOR: mworker: restore initial env before wait
mode").
In cases, when haproxy starts without any configuration, for example:
'haproxy -vv', init_env array to backup env variables is never allocated. So,
we need to check in deinit(), when we free its memory, that init_env is not a
NULL ptr.
MINOR: mworker: restore initial env before wait mode
This patch is the follow-up of 1811d2a6ba (MINOR: tools: add helpers to
backup/clean/restore env).
In order to avoid unexpected behaviour in master-worker mode during the process
reload with a new configuration, when the old one has contained '*env' keywords,
let's backup its initial environment before calling parse_cfg() and let's clean
and restore it in the context of master process, just before it enters in a wait
polling loop.
This will garantee that new workers will have a new updated environment and not
the previous one inherited from the master, which does not read the configuration,
when it's in a wait-mode.
MINOR: tools: add helpers to backup/clean/restore env
'setenv', 'presetenv', 'unsetenv', 'resetenv' keywords in configuration could
modify the process runtime environment. In case of master-worker mode this
creates a problem, as the configuration is read only once before the forking a
worker and then the master process does the reexec without reading any config
files, just to free the memory. So, during the reload a new worker process will
be created, but it will inherited the previous unchanged environment from the
master in wait mode, thus it won't benefit the changes in configuration,
related to '*env' keywords. This may cause unexpected behavior or some parser
errors in master-worker mode.
So, let's add a helper to backup all process env variables just before it will
read its configuration. And let's also add helpers to clean up the current
runtime environment and to restore it to its initial state (as it was before
parsing the config).
Amaury Denoyelle [Wed, 21 Aug 2024 13:30:17 +0000 (15:30 +0200)]
MINOR: mux-quic: add buf_in_flight to QCC debug infos
Dump <buf_in_flight> QCC field both in QUIC MUX traces and "show quic".
This could help to detect if MUX does not allocate enough buffers
compared to quic_conn current congestion window.
Willy Tarreau [Wed, 21 Aug 2024 15:50:03 +0000 (17:50 +0200)]
[RELEASE] Released version 3.1-dev6
Released version 3.1-dev6 with the following main changes :
- BUG/MINOR: proto_tcp: delete fd from fdtab if listen() fails
- BUG/MINOR: proto_tcp: keep error msg if listen() fails
- MINOR: proto_tcp: tcp_bind_listener: copy errno in errmsg
- MINOR: channel: implement ci_insert() function
- BUG/MEDIUM: mworker/cli: fix pipelined modes on master CLI
- REGTESTS: mcli: test the pipelined commands on master CLI
- MINOR: cfgparse: load_cfg_in_mem: fix null ptr dereference reported by coverity
- MINOR: startup: fix unused value reported by coverity
- BUG/MINOR: mux-quic: do not send too big MAX_STREAMS ID
- BUG/MINOR: proto_uxst: delete fd from fdtab if listen() fails
- BUG/MINOR: cfgparse: parse_cfg: fix null ptr dereference reported by coverity
- MINOR: proto_uxst: copy errno in errmsg for syscalls
- MINOR: mux-quic: do not trace error in qcc_send_frames() on empty list
- BUG/MINOR: h3: properly reject too long header responses
- CLEANUP: mworker/cli: clean up the mode handling
- BUG/MINOR: tools: make fgets_from_mem() stop at the end of the input
- BUG/MINOR: pattern: pat_ref_set: fix UAF reported by coverity
- BUG/MINOR: pattern: pat_ref_set: return 0 if err was found
- CI: keep logs for failed QIUC Interop jobs
- BUG/MINOR: release-estimator: fix relative scheme in CHANGELOG URL
- MINOR: release-estimator: add requirements.txt
- MINOR: release-estimator: add installation steps in README.md
- MINOR: release-estimator: fix the shebang of the python script
- DOC: config: correct the table for option tcplog
- MEDIUM: log: relax some checks and emit diag warnings instead in lf_expr_postcheck()
- MINOR: log: "drop" support for log-profile steps
- CI: QUIC Interop LibreSSL: document chacha20 test status
- CI: modernize codespell action, switch to node 16
- CI: QUIC Interop AWS-LC: enable chrome client
- DOC: lua: fix incorrect english in lua.txt
- MINOR: Implements new log format of option tcplog clf
- MINOR: cfgparse: limit file size loaded via /dev/stdin
- BUG/MINOR: stats: fix color of input elements in dark mode
- CLEANUP: stats: use modern DOCTYPE tag
- BUG/MINOR: stats: add lang attribute to html tag
- DOC: quic: fix default minimal value for max window size
- DOC: quic: document nocc debug congestion algorithm
- MINOR: quic: extract config window-size parsing
- MINOR: quic: define max-window-size config setting
- MINOR: quic: allocate stream txbuf via qc_stream_desc API
- MINOR: mux-quic: account stream txbuf in QCC
- MEDIUM: mux-quic: implement API to ignore txbuf limit for some streams
- MINOR: h3: mark control stream as metadata
- MINOR: mux-quic: define buf_in_flight
- MAJOR: mux-quic: allocate Tx buffers based on congestion window
- MINOR: quic/config: adapt settings to new conn buffer limit
- MINOR: quic: define sbuf pool
- MINOR: quic: support sbuf allocation in quic_stream
- MEDIUM: h3: allocate small buffers for headers frames
- MINOR: mux-quic: retry after small buf alloc failure
- BUG/MINOR: cfgparse-global: fix err msg in mworker keyword parser
- BUG/MINOR: cfgparse-global: clean common_kw_list
- BUG/MINOR: cfgparse-global: remove redundant goto
- MINOR: cfgparse-global: move 'pidfile' in global keywords list
- MINOR: cfgparse-global: move 'expose-*' in global keywords list
- MINOR: cfgparse-global: move tune options in global keywords list
- MINOR: cfgparse-global: move unsupported keywords in global list
- BUG/MINOR: cfgparse-global: remove tune.fast-forward from common_kw_list
- MINOR: quic: store the lost packets counter in the quic_cc_event element
- MINOR: quic: support a tolerance for spurious losses
- MINOR: protocol: properly assign the sock_domain and sock_family
- MINOR: protocol: add a family lookup
- MEDIUM: socket: always properly use the sock_domain for requested families
- MINOR: protocol: add the real address family to the protocol
- MINOR: socket: don't ban all custom families from reuseport
- MINOR: protocol: always initialize the receivers list on registration
- CLEANUP: protocol: no longer initialize .receivers nor .nb_receivers
Willy Tarreau [Fri, 9 Aug 2024 19:32:29 +0000 (21:32 +0200)]
MINOR: protocol: always initialize the receivers list on registration
Till now, protocols were required to self-initialize their receivers
list head, which is not very convenient, and is quite error prone.
Indeed, it's too easy to copy-paste a protocol definition and forget
to update the .receivers field to point to itself, resulting in mixed
lists. Let's just do that in protocol_register(). And while we're at
it, let's also zero the nb_receivers entry that works with it, so that
the protocol definition isn't required to pre-initialize stuff related
to internal book-keeping.
Willy Tarreau [Fri, 9 Aug 2024 19:02:24 +0000 (21:02 +0200)]
MINOR: socket: don't ban all custom families from reuseport
The test on ss_family >= AF_MAX is too strict if we want to support new
custom families, let's apply this to the real_family instead so that we
check that the underlying socket supports reuseport.
Willy Tarreau [Fri, 9 Aug 2024 18:13:10 +0000 (20:13 +0200)]
MINOR: protocol: add the real address family to the protocol
For custom families, there's sometimes an underlying real address and
it would be nice to be able to directly use the real family in calls
to bind() and connect() without having to add explicit checks for
exceptions everywhere.
Let's add a .real_family field to struct proto_fam for this. For now
it's always equal to the family except for non-transferable ones such
as rhttp where it's equal to the custom one (anything else could fit).
Willy Tarreau [Fri, 9 Aug 2024 17:37:44 +0000 (19:37 +0200)]
MEDIUM: socket: always properly use the sock_domain for requested families
Now we make sure to always look up the protocol's domain for an address
family. Previously we would use it as-is, which prevented from properly
using custom addresses (which is when they differ).
This removes some hard-coded tests such as in log.c where UNIX vs UDP
was explicitly checked for example. It requires a bit of care, however,
so as to properly pass value 1 in the 3rd arg of the protocol_lookup()
for DGRAM stuff. Maybe one day we'll change these for defines or enums
to limit mistakes.
Willy Tarreau [Fri, 9 Aug 2024 18:30:31 +0000 (20:30 +0200)]
MINOR: protocol: add a family lookup
At plenty of places we have access to an address family which may
include some custom addresses but we cannot simply convert them to
the real families without performing some random protocol lookups.
Let's simply add a proto_fam table like we have for the protocols.
The protocols could even be indexed there, but for now it's not worth
it.
Willy Tarreau [Fri, 9 Aug 2024 16:59:46 +0000 (18:59 +0200)]
MINOR: protocol: properly assign the sock_domain and sock_family
When we finally split sock_domain from sock_family in 2.3, something
was not cleanly finished. The family is what should be stored in the
address while the domain is what is supposed to be passed to socket().
But for the custom addresses, we did the opposite, just because the
protocol_lookup() function was acting on the domain, not the family
(both of which are equal for non-custom addresses).
This is an API bug but there's no point backporting it since it does
not have visible effects. It was visible in the code since a few places
were using PF_UNIX while others were comparing the domain against AF_MAX
instead of comparing the family.
This patch clarifies this in the comments on top of proto_fam, addresses
the indexing issue and properly reconfigures the two custom families.
MINOR: quic: support a tolerance for spurious losses
Tests performed between a 1 Gbps connected server and a 100 mbps client,
distant by 95ms showed that:
- we need 1.1 MB in flight to fill the link
- rare but inevitable losses are sufficient to make cubic's window
collapse fast and long to recover
- a 100 MB object takes 69s to download
- tolerance for 1 loss between two ACKs suffices to shrink the download
time to 20-22s
- 2 losses go to 17-20s
- 4 losses reach 14-17s
At 100 concurrent connections that fill the server's link:
- 0 loss tolerance shows 2-3% losses
- 1 loss tolerance shows 3-5% losses
- 2 loss tolerance shows 10-13% losses
- 4 loss tolerance shows 23-29% losses
As such while there can be a significant gain sometimes in setting this
tolerance above zero, it can also significantly waste bandwidth by sending
far more than can be received. While it's probably not a solution to real
world problems, it repeatedly proved to be a very effective troubleshooting
tool helping to figure different root causes of low transfer speeds. In
spirit it is comparable to the no-cc congestion algorithm, i.e. it must
not be used except for experimentation.
MINOR: quic: store the lost packets counter in the quic_cc_event element
Upon loss detection, qc_release_lost_pkts() notifies congestion
controllers about the event and its final time. However it does not
pass the number of lost packets, that can provide useful hints for
some controllers. Let's just pass this option.
BUG/MINOR: cfgparse-global: remove tune.fast-forward from common_kw_list
Remove tune.fast-forward from common_kw_list. It was replaced by
'tune.disable-fast-forward' and it's no longer present in "if..else if.."
parser from cfg_parse_global(). Otherwise, it may be shown as the best-match
keyword for some tune options, which is now wrong.
MINOR: cfgparse-global: move unsupported keywords in global list
Following the previous commits and in order to clean up cfg_parse_global let's
move unsupported keywords in the global list and let's add for them a dedicated
parser.
MINOR: cfgparse-global: move tune options in global keywords list
In order to clean up cfg_parse_global() and to add the support of the new
MODE_DISCOVERY in configuration parsing, let's move the keywords related to
tune options into the global keywords list and let's add for them two dedicated
parsers. Tune options keywords are sorted between two parsers in dependency of
parameters number, which a given tune option needs.
tune options parser is called by section parser and follows the common API, i.e.
it returns -1 on failure, 0 on success and 1 on recoverable error. In case of
recoverable error we've previously returned ERR_ALERT (0x10) and we have emitted
an alert message at startup. Section parser treats all rc > 0 as ERR_WARN. So in
case, if some tune option was set twice in the global section, tune
options parser will return 1 (in order to respect the common API), section
parser will treat this as ERR_WARN and a warning message will be emitted during
process startup instead of alert, as it was before.
MINOR: cfgparse-global: move 'expose-*' in global keywords list
Following the previous commit let's also move 'expose-*' keywords in the global
cfg_kws list and let's add for them a dedicated parser. This will simplify the
configuration parsing in the new MODE_DISCOVERY, which allows to read only the
keywords, needed at the early start of haproxy process (i.e. modes, pidfile,
chosen poller).
MINOR: cfgparse-global: move 'pidfile' in global keywords list
This commit cleans up cfg_parse_global() and prepares the config parser to
support MODE_DISCOVERY. This step is needed in early starting stage, just to
figura out in which mode the process was started, to set some necessary
parameteres needed for this mode and to continue the initialization
stage.
'pidfile' makes part of such common keywords, which are needed to be parsed
very early and which are used almost in all process modes (except the
foreground, '-d').
'pidfile' keyword parser is called by section parser and follows the common
API, i.e. it returns -1 on failure, 0 on success and 1 on recoverable error. In
case of recoverable error we've previously returned ERR_ALERT (0x10) and we have
emitted an alert message at startup. Section parser treats all rc > 0 as
ERR_WARN. So in case, if pidfile was already specified via command line, the
keyword parser will return 1 (in order to respect the common API), section
parser will treat this as ERR_WARN and a warning message will be emitted during
process startup instead of alert, as it was before.
In the case, when the given keyword was found in the global 'cfg_kws' list, we
go to 'out' label anyway, after testing rc returned by the keyword's parser. So
there is not a much gain if we perform 'goto out' jump specifically when rc > 0.
This patch fixes commits 118ac11ce
("MINOR: cfgparse-global: move mode's keywords in cfg_kw_list") and 83ff4db18
(MINOR: cfgparse-global: move no<poller_name> in cfg_kw_list).
'common_kw_list' serves to show the best-match keyword in cfg_parse_global(), if
the given keyword was not parsed in "if..else if.." cases. cfg_parse_global()
is still used as a parser for some keywords from the global section.
Mode-specific and no<poller_name> keywords now have their own parsers. They no
longer take place in the "if..else if.." from cfg_parse_global() and they are
registered in the 'cfg_kws' list. So, there is no longer need to duplicate
them in the 'common_kw_list'. Otherwise, they will be shown twice in parser
error message.
BUG/MINOR: cfgparse-global: fix err msg in mworker keyword parser
This patch fixes the commit 118ac11ce
("cfgparse-global: move mode's keywords in cfg_kw_list"). Error message
delivered by keyword parser in **err is always shown with ha_alert() by the
caller cfg_parse_global(). The caller always supplies these alerts with the
filename and the line number.
MINOR: mux-quic: retry after small buf alloc failure
Previous commit switch to small buffers for HTTP/3 HEADERS emission.
This ensures that several parallel streams can allocate their own buffer
without hitting the connection buffer limit based now on the congestion
window size.
However, this prevents the transmission of responses with uncommonly
large headers. Indeed, if all headers cannot be encoded in a single
buffer, an error is reported which cause the whole connection closure.
Adjust this by implementing a realloc API exposed by QUIC MUX. This
allows application layer to switch from a small to a default buffer and
restart its processing. This guarantees that again headers not longer
than bufsize can be properly transferred.
Amaury Denoyelle [Wed, 14 Aug 2024 09:07:44 +0000 (11:07 +0200)]
MEDIUM: h3: allocate small buffers for headers frames
A major change was recently implemented to change QUIC MUX Tx buffer
allocation limit, which is now based on the current connection
congestion window size. As this size may be smaller than the previous
static value, it is likely that the limit will be reached more
frequently.
When using HTTP/3, the majority of requests streams are used for small
object exchanges. Every responses start with a HEADERS frames which
should be much smaller in size than the default buffer. But as the whole
buffer size is accounted against the congestion window, a single stream
can block others even if only emitting a single HEADERS frame which is
suboptimal for bandwith usage, if the congestion window is small enough.
To adapt to this new situation, rely on the newly available small
buffers to transfer HEADERS frame response. This at least guarantee that
several parallel streams could allocate their own buffer for the first
part of the response, even with a small congestion window.
The situation could be further improve to use various indication on the
data size and select a small buffer if sufficient. This could be done
for example via the Content-length value or HTX extra field. However
this must be the subject of a dedicated patch.
Amaury Denoyelle [Thu, 13 Jun 2024 13:26:51 +0000 (15:26 +0200)]
MINOR: quic: support sbuf allocation in quic_stream
This patch extends qc_stream_desc API to be able to allocate small
buffers. QUIC MUX API is similarly updated as ultimatly each application
protocol is responsible to choose between a default or a smaller buffer.
Internally, the type of allocated buffer is remembered via qc_stream_buf
instance. This is mandatory to ensure that the buffer is released in the
correct pool, in particular as small and standard buffers can be
configured with the same size.
This commit is purely an API change. For the moment, small buffers are
not used. This will changed in a dedicated patch.
Amaury Denoyelle [Tue, 13 Aug 2024 07:34:28 +0000 (09:34 +0200)]
MINOR: quic: define sbuf pool
Define a new buffer pool reserved to allocate smaller memory area. For
the moment, its usage will be restricted to QUIC, as such it is declared
in quic_stream module.
Add a new config option "tune.bufsize.small" to specify the size of the
allocated objects. A special check ensures that it is not greater than
the default bufsize to avoid unexpected effects.
Amaury Denoyelle [Wed, 14 Aug 2024 09:07:13 +0000 (11:07 +0200)]
MINOR: quic/config: adapt settings to new conn buffer limit
QUIC MUX buffer allocation limit is now directly based on the underlying
congestion window size. previous static limit based on conn-tx-buffers
is now unused. As such, this commit adds a warning to users to prevent
that it is now obsolete.
Secondly, update max-window-size setting. It is now the main entrypoint
to limit both the maximum congestion window size and the number of QUIC
MUX allocated buffer on emission. Remove its special value '0' which was
used to automatically adjust it on now unused conn-tx-buffers.
Amaury Denoyelle [Thu, 13 Jun 2024 15:06:40 +0000 (17:06 +0200)]
MAJOR: mux-quic: allocate Tx buffers based on congestion window
Each QUIC MUX may allocate buffers for MUX stream emission. These
buffers are then shared with quic_conn to handle ACK reception and
retransmission. A limit on the number of concurrent buffers used per
connection has been defined statically and can be updated via a
configuration option. This commit replaces the limit to instead use the
current underlying congestion window size.
The purpose of this change is to remove the artificial static buffer
count limit, which may be difficult to choose. Indeed, if a connection
performs with minimal loss rate, the buffer count would limit severely
its throughput. It could be increase to fix this, but it also impacts
others connections, even with less optimal performance, causing too many
extra data buffering on the MUX layer. By using the dynamic congestion
window size, haproxy ensures that MUX buffering corresponds roughly to
the network conditions.
Using QCC <buf_in_flight>, a new buffer can be allocated if it is less
than the current window size. If not, QCS emission is interrupted and
haproxy stream layer will subscribe until a new buffer is ready.
One of the criticals parts is to ensure that MUX layer previously
blocked on buffer allocation is properly woken up when sending can be
retried. This occurs on two occasions :
* after an already used Tx buffer is cleared on ACK reception. This case
is already handled by qcc_notify_buf() via quic_stream layer.
* on congestion window increase. A new qcc_notify_buf() invokation is
added into qc_notify_send().
Finally, remove <avail_bufs> QCC field which is now unused.
This commit is labelled MAJOR as it may have unexpected effect and could
cause significant behavior change. For example, in previous
implementation QUIC MUX would be able to buffer more data even if the
congestion window is small. With this patch, data cannot be transferred
from the stream layer which may cause more streams to be shut down on
client timeout. Another effect may be more CPU consumption as the
connection limit would be hit more often, causing more streams to be
interrupted and woken up in cycle.
Amaury Denoyelle [Tue, 13 Aug 2024 06:51:46 +0000 (08:51 +0200)]
MINOR: mux-quic: define buf_in_flight
Define a new QCC counter named <buf_in_flight>. Its purpose is to
account the current sum of all allocated stream buffer size used on
emission.
For this moment, this counter is updated and buffer allocation and
deallocation. It will be used to replace <avail_bufs> once congestion
window is used as limit for buffer allocation in a future commit.
Amaury Denoyelle [Mon, 19 Aug 2024 08:28:40 +0000 (10:28 +0200)]
MINOR: h3: mark control stream as metadata
A current work is performed to change QUIC MUX buffer allocation limit
from a configurable static value to use the size of the congestion
window instead. This change may cause the buffer allocation limit to be
triggered more frequently.
To ensure HTTP/3 control emission is not perturbed by this change, mark
the stream with qcc_send_metadata(). This ensures that buffer allocation
for this stream won't be subject to the connection limit. This is
necessary to guarantee that SETTINGS and GOAWAY frames are emitted.
Amaury Denoyelle [Mon, 19 Aug 2024 08:22:02 +0000 (10:22 +0200)]
MEDIUM: mux-quic: implement API to ignore txbuf limit for some streams
Define a new qc_stream_desc flag QC_SD_FL_OOB_BUF. This is to mark
streams which are not subject to the connection limit on allocated MUX
stream buffer.
The purpose is to simplify handling of QUIC MUX streams which do not
transfer data and as such are not driven by haproxy layer, for example
HTTP/3 control stream. These streams interacts synchronously with QUIC
MUX and cannot retry emission in case of temporary failure.
This commit will be useful once connection buffer allocation limit is
reimplemented to directly rely on the congestion window size. This will
probably cause the buffer limit to be reached more frequently, maybe
even on QUIC MUX initialization. As such, it will be possible to mark
control streams and prevent them to be subject to the buffer limit.
QUIC MUX expose a new function qcs_send_metadata(). It can be used by an
application protocol to specify which streams are used for control
exchanges. For the moment, no such stream use this mechanism.
Amaury Denoyelle [Tue, 13 Aug 2024 09:57:50 +0000 (11:57 +0200)]
MINOR: mux-quic: account stream txbuf in QCC
A limit per connection is put on the number of buffers allocated by QUIC
MUX for emission accross all its streams. This ensures memory
consumption remains under control. This limit is simply explained as a
count of buffers which can be concurrently allocated for each
connection.
As such, quic_conn structure was used to account currently allocated
buffers. However, a quic_conn nevers allocates new stream buffers. This
is only done at QUIC MUX layer. As such, this commit moves buffer
accounting inside QCC structure. This simplifies the API, most notably
qc_stream_buf_alloc() usage.
Note that this commit inverts the accounting. Previously, it was
initially set to 0 and increment for each allocated buffer. Now, it is
set to the maximum value and decrement for each buf usage. This is
considered as clearer to use.
Amaury Denoyelle [Tue, 13 Aug 2024 09:08:08 +0000 (11:08 +0200)]
MINOR: quic: allocate stream txbuf via qc_stream_desc API
This commit simply adjusts QUIC stream buffer allocation. This operation
is conducted by QUIC MUX using qc_stream_desc layer. Previously,
qc_stream_buf_alloc() would return a qc_stream_buf instance and QUIC MUX
would finalized the buffer area allocation. Change this to perform the
buffer allocation directly into qc_stream_buf_alloc().
This patch clarifies the interaction between QUIC MUX and
qc_stream_desc. It is cleaner to allocate the buffer via qc_stream_desc
as it is already responsible to free the buffer.
It also ensures that connection buffer accounting is only done after the
whole qc_stream_buf and its buffer are allocated. Previously, the
increment operation was performed between the two steps. This was not an
issue, as this kind of error triggers the whole connection closure.
However, if in the future this is handled as a stream closure instead,
this commit ensures that the buffer remains valid in all cases.
Define a new global keyword tune.quic.frontend.max-window-size. This
allows to set globally the maximum congestion window size for each QUIC
frontend connections.
The default value is 0. It is a special value which automatically derive
the size from the configured QUIC connection buffer limit. This is
similar to the previous "quic-cc-algo" behavior, which can be used to
override the maximum window size per bind line.
Amaury Denoyelle [Wed, 14 Aug 2024 16:30:34 +0000 (18:30 +0200)]
MINOR: quic: extract config window-size parsing
quic-cc-algo is a bind line keyword which allow to select a QUIC
congestion algorithm. It can take an optional integer to specify the
maximum window size. This value is an integer and support the suffixes
'k', 'm' and 'g' to specify respectively kilobytes, megabytes and
gigabytes.
Extract the maximum window size parsing in a dedicated function named
parse_window_size(). It accepts as input an integer value with an
optional suffix, 'k', 'm' or 'g'. The first invalid character is
returned by the function to the caller.
No functional change. This commit will allow to quickly implement a new
keyword to configure a default congestion window size in the global
section.
Document nocc congestion algorithm as an entry of quic-cc-algo.
Highlight the fact that it is reserved for debugging and should not be
used outside of this use case.
Amaury Denoyelle [Wed, 14 Aug 2024 16:25:01 +0000 (18:25 +0200)]
DOC: quic: fix default minimal value for max window size
It is possible to override the default QUIC congestion algorithm on a
bind line. With the same setting, it is also possible to specify the
maximum congestion window size.
The parser rejects values outside of the range between 10k and 4g. This
is in contradiction with the documentation which specify 1k as the lower
value. Correct this value in the documentation.
Nicolas CARPi [Tue, 20 Aug 2024 13:20:10 +0000 (15:20 +0200)]
BUG/MINOR: stats: add lang attribute to html tag
The "html" element of the stats page was missing a "lang" attribute.
This change specifies the "en" value, which corresponds to english
language.
It is also a required element for WCAG Success Criterion 3.1.1, which
renders the web more accessible through a set of requirements. In this
case it allows assistive technologies such as screen readers to
determine the language of the page.
MDN page: https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/lang
HTML standard: https://html.spec.whatwg.org/multipage/dom.html#attr-lang
WCAG criterion: https://www.w3.org/WAI/WCAG22/Understanding/language-of-page.html
Nicolas CARPi [Tue, 20 Aug 2024 13:12:23 +0000 (15:12 +0200)]
CLEANUP: stats: use modern DOCTYPE tag
Switching the stats page doctype to the modern standard is shorter and
less complex, and is the recommended doctype by current HTML standard.
It makes it clear that we do not want to run in quirks mode. More information below.
Quirks mode: https://developer.mozilla.org/en-US/docs/Web/HTML/Quirks_Mode_and_Standards_Mode
HTML Standard: https://html.spec.whatwg.org/multipage/syntax.html#the-doctype
MINOR: cfgparse: limit file size loaded via /dev/stdin
load_cfg_in_mem() can continuously reallocate memory in order to load an
extremely large input from /dev/stdin, until it fails with ENOMEM, which means
that process has consumed all available RAM. In case of containers and
virtualized environments it's not very good.
So, in order to prevent this, let's introduce MAX_CFG_SIZE as 10MB, which will
limit the size of input supplied via /dev/stdin.
Nathan Wehrman [Tue, 13 Aug 2024 18:25:38 +0000 (11:25 -0700)]
MINOR: Implements new log format of option tcplog clf
Some systems require log formats in the CLF format and that meant that I
could not send my logs for proxies in mode tcp to those servers. This
implements a format that uses log variables that are compatble with TCP
mode frontends and replaces traditional HTTP values in the CLF format
to make them stand out. Instead of logging method and URI like this
"GET /example HTTP/1.1" it will log "TCP " and for a response code I
used "000" so it would be easy to separate from legitimate HTTP
traffic. Now your log servers that require a CLF format can see the
timings for TCP traffic as well as HTTP.
Ilia Shipitsin [Tue, 13 Aug 2024 19:11:30 +0000 (21:11 +0200)]
CI: modernize codespell action, switch to node 16
The following actions uses node12 which is deprecated and will be forced
to run on node16: codespell-project/codespell-problem-matcher@v1. For
more info:
https://github.blog/changelog/2023-06-13-github-actions-all-actions-will-run-on-node16-instead-of-node12-by-default/
It is now possible to use "drop" keyword for "on" lines under a
log-profile section to specify that no log at all should be emitted for
the specified step (setting an empty format was not sufficient to do so
because only the log payload would be empty, not the log header, thus the
log would still be emitted).
It may be useful to selectively disable logging at specific steps for a
given log target (since the log profile may be set on log directives):
log-profile myprof
on request format "blabla" sd "custom sd"
on response drop
New testcase was added to reg-tests/log/log_profiles.vtc
MEDIUM: log: relax some checks and emit diag warnings instead in lf_expr_postcheck()
With 7a21c3a ("MAJOR: log: implement proper postparsing for logformat
expressions") which finally made postparsing checks reliable, we started
to get report from users that couldn't start haproxy 3.0 with configs that
used to work in the past. The current situation is described in GH #2642.
While the checks are mostly relevant, it turns out there are not strictly
needed anymore from a technical point of view. Most of them were useful in
early logformat implementation to prevent runtime bugs due to the use of
an alias or fetch at runtime from an incompatible proxy. It's been a few
versions already that the code handling fetches and log aliases is robust
enough to support fetches/aliases used from the wrong context: all it
does is that the fetch/alias will silently fail if it's not available.
This can be proved by the fact that even if the postparsing checks were
partially broken in the past, it didn't cause runtime issues (at least
on recent haproxy versions).
Most of these checks can now be seen as configuration hints: when a check
triggers, it will indicate a configuration inconsistency in most cases,
but they are some corner cases where it is not possible to know at config
time if the conditions will be met for the alias/fetch to work properly..
so instead of failing with a hard error like we did so far, let's just be
more permissive and report our findings using "diag_warning": such
warnings are only emitted when haproxy is started with '-dD' cli option.
We also took this opportunity to improve messages clarity and make them
more precise (report the offending item instead of complaining about the
whole expression because of a single element).
With this patch, configs that used to start before 7a21c3a shouldn't
trigger hard errors anymore.
MINOR: release-estimator: fix the shebang of the python script
Fix the shebang of the python script to use /usr/bin/env, allowing to
call the script directly from a virtualenv with `./release-estimator.py`
without using the python3 install of the system.
BUG/MINOR: release-estimator: fix relative scheme in CHANGELOG URL
The CHANGELOG URL which is parsed in the HTML now have a relative
scheme, which is incompatible with requests. This patch adds an https
scheme to the URL.
Ilia Shipitsin [Wed, 7 Aug 2024 11:02:29 +0000 (13:02 +0200)]
CI: keep logs for failed QIUC Interop jobs
it might be useful to investigate logs of failed tests. to keep
artifacts small the following actions are taken
- only failed logs are kept
- logs retention is 6 days
BUG/MINOR: pattern: pat_ref_set: return 0 if err was found
pat_ref_set_elt() returns 0, if we are run out of memory or can't parse a new
map value. Any arror message emitted by pat_ref_set_elt() is saved in err
buffer, if its provided by caller. These error messages are cumulated during
the loop.
pat_ref_set() is used to update values in map, referred to the same given key.
If during the update pat_ref_set_elt() fails, let's retun 0 to caller
immediately. We have the same non-unique key and the same new value in each
loop. So it seems quite odd to cumulate the same error messages and print it in
CLI:
BUG/MINOR: pattern: pat_ref_set: fix UAF reported by coverity
memprintf() performs realloc and updates then the pointer to an output buffer,
where it has written the data. So free() is called on the previous buffer
address, if it was provided.
pat_ref_set_elt() uses memprintf() to write its error message as well as
pat_ref_set(). So, when we re-enter into the while loop the second time and
pat_ref_set_elt() has returned, the *err ptr (previous value of *merr) is
already freed by memprintf() from pat_ref_set_el().
'if (!found)' condition is false at this point, because we've found a node at
the first loop. So, the second memprintf(), in order to write error messages,
does again free(*err).
BUG/MINOR: h3: properly reject too long header responses
When encoding HTX to HTTP/3 headers on the response path, a bunch of
ABORT_NOW() where used when buffer room was not enough. In most cases
this is safe as output buffer has just been allocated and so is empty at
the start of the function. However, with a header list longer than a
whole buffer, this would cause an unexpected crash.
Fix this by removing ABORT_NOW() statement with proper error return
path. For the moment, this would cause the whole connection to be close
rather than the stream only. This may be further improved in the future.
Also remove ABORT_NOW() when encoding frame length at the end of headers
or trailers encoding. Buffer room is sufficient as it was already
checked prior in the same function.
This should be backported up to 2.6. Special care should be handled
however as this code path has changed frequently :
* for 2.9 and older, the extra following statement must be inserted
prior each newly added goto statement :
h3c->err = H3_INTERNAL_ERROR;
* for 2.6, trailers support is not implemented. As such, related chunks
should just be ignored when backporting.
MINOR: mux-quic: do not trace error in qcc_send_frames() on empty list
qcc_send_frames() can be called with an empty list and returns
immediately with an error code. This is convenience to be able to call
it in a while loop.
Remove the trace with "error" when this is the case and replacing it
with a less alarming "leaving on..." message. This should help debugging
when traces are active.
BUG/MINOR: cfgparse: parse_cfg: fix null ptr dereference reported by coverity
This commit fixes potential null ptr dereferences reported by coverity, see
more details about it in the issues #2676 and #2668.
'outline' ptr, which is initialized to NULL explicitly as a temporary buffer to
store split keywords may be in theory implicitly dereferenced in some corner
cases (which we haven't encountered yet with real world configurations) in
'if (!**args)'. parse_line() code, called before under some conditions
assigns: args[arg] = outline + outpos and outpos initial value is 0.