Store a pointer to the expression (struct pattern_expr) into the data structure
used to chain/store the map element references (struct pat_ref_elt) , e.g. the
struct pattern_tree when stored into an ebtree or struct pattern_list when
chained to a list.
Modify pat_ref_set_elt() to stop inspecting all the expressions attached to a map
and to look for the <elt> element passed as parameter to retrieve the sample data
to be parsed. Indeed, thanks to the pointer added above to each pattern tree nodes
or list elements, they all can be inspected directly from the <elt> passed as
parameter and its ->tree_head and ->list_head member: the pattern tree nodes are
stored into elt->tree_head, and the pattern list elements are chained to
elt->list_head list. This inspection was also the job of pattern_find_smp() which
is no more useful. This patch removes the code of this function.
Organize reference to pattern element of map (struct pat_ref_elt) into an ebtree:
- add an eb_root member to the map (pat_ref struct) and an ebpt_node to its
element (pat_ref_elt struct),
- modify the code to insert these nodes into their ebtrees each time they are
allocated. This is done in pat_ref_append().
Note that ->head member (struct list) of map (struct pat_ref) is not removed
could have been removed. This is not the case because still necessary to dump
the map contents from the CLI in the order the map elememnts have been inserted.
This patch also modifies http_action_set_map() which is the callback at least
used by "set-map" action. The pat_ref_elt element returned by pat_ref_find_elt()
is no more ignored, but reused if not NULL by pat_ref_set() as first element to
lookup from. This latter is also modified to use the ebtree attached to the map
in place of the ->head list attached to each map element (pat_ref_elt struct).
Also modify pat_ref_find_elt() to makes it use ->eb_root map ebtree added to the
map by this patch in place of inspecting all the elements with a strcmp() call.
Willy Tarreau [Mon, 21 Aug 2023 06:45:35 +0000 (08:45 +0200)]
BUG/MINOR: ssl_sock: fix possible memory leak on OOM
That's the classical realloc() issue: if it returns NULL, the old area
is not freed but we erase the pointer. It was brought by commit e18d4e828
("BUG/MEDIUM: ssl: backend TLS resumption with sni and TLSv1.3"), and
should be backported where this commit was backported.
"converter" was used in place of "action" as a result of a copy-paste
error probably.
Also, rephrasing the "actions" keyword explanation to prevent confusion
between action name (which is the name of new action about to be created)
and action facilities where we want to expose the new action.
This could be backported to every stable versions.
As reported by Coverity in GH #2253, stktable_data_ptr() usage in
hlua_stktable_dump() func is potentially unsafe because
stktable_data_ptr() may return NULL and the returned value is
dereferenced as-is without precautions.
In practise, this should not happen because some error checking was
already performed prior to calling stktable_data_ptr(). But since we're
using the safe stktable_data_ptr() function, all the error checking is
already done within the function, thus all we need to do is check ptr
against NULL instead to protect against NULL dereferences.
This should be backported in every stable versions.
Amaury Denoyelle [Thu, 24 Aug 2023 15:32:50 +0000 (17:32 +0200)]
BUG/MINOR: h2: fix reverse if no timeout defined
h2c.task is not allocated in h2_init() if timeout client/server is not
defined depending on the connection side. This caused crash on
connection reverse due to systematic requeuing of h2c.task in
h2_conn_reverse().
To fix this, check h2c.task in h2_conn_reverse(). If old timeout was
undefined but new one is, h2c.task must be allocated as it was not in
h2_init(). On the opposite situation, if old timeout was defined and new
one is not, h2c.task is freed. In this case, or if neither timeout are
defined, skip the task requeuing.
This bug is easily reproduced by using reverse bind or server with
undefined timeout client/server depending on the connection reverse
direction.
This bug has been introduced by reverse connect support.
No need to backport it.
Amaury Denoyelle [Mon, 14 Aug 2023 09:10:06 +0000 (11:10 +0200)]
REGTESTS: write a full reverse regtest
This test instantiates two haproxy instances :
* first one uses a reverse server with two bind pub and priv
* second one uses a reverse bind to initiate connection to priv endpoint
On startup, only first haproxy instance is up. A client send a request
to pub endpoint and should receive a HTTP 503 as no connection are
available on the reverse server.
Second haproxy instance is started. A delay of 3 seconds is inserted to
wait for the connection between the two LBs. Then a client retry the
request and this time should receive a HTTP 200 reusing the bootstrapped
connection.
Amaury Denoyelle [Mon, 14 Aug 2023 14:38:23 +0000 (16:38 +0200)]
MEDIUM: h2: prevent stream opening before connection reverse completed
HTTP/2 demux must be handled with care for active reverse connection.
Until accept has been completed, it should be forbidden to handle
HEADERS frame as session is not yet ready to handle streams.
To implement this, use the flag H2_CF_DEM_TOOMANY which blocks demux
process. This flag is automatically set just after conn_reverse()
invocation. The flag is removed on rev_accept_conn() callback via a new
H2 ctl enum. H2 tasklet is woken up to restart demux process.
As a side-effect, reporting in H2 mux may be blocked as demux functions
are used to convert error status at the connection level with
CO_FL_ERROR. To ensure error is reported for a reverse connection, check
h2c_is_dead() specifically for this case in h2_wake(). This change also
has its own side-effect : h2c_is_dead() conditions have been adjusted to
always exclude !h2c->conn->owner condition which is always true for
reverse connection or else H2 mux may kill them unexpectedly.
Only minimal steps are done here : HTTP version session counters are
incremented on the listener instance. Also, the connection is inserted
in the mux_stopping_list to ensure it will be actively closed on process
shutdown/listener suspend.
Amaury Denoyelle [Mon, 14 Aug 2023 08:52:50 +0000 (10:52 +0200)]
MINOR: proto_reverse_connect: handle early error before reversal
An error can occured on a reverse connection before accept is completed.
In this case, no parent session can be notified. Instead, wake up the
receiver task on conn_create_mux().
As a counterpart to this, receiver task is extended to match CO_FL_ERROR
flag on pending connection. In this case, the onnection is freed. The
task is then requeued with a 1 second delay to start a new reverse
connection attempt.
Amaury Denoyelle [Wed, 23 Aug 2023 15:16:07 +0000 (17:16 +0200)]
MEDIUM: proto_reverse_connect: bootstrap active reverse connection
Implement active reverse connection initialization. This is done through
a new task stored in the receiver structure. This task is instantiated
via bind callback and first woken up via enable callback.
Task handler is separated into two halves. On the first step, a new
connection is allocated and stored in <pend_conn> member of the
receiver. This new client connection will proceed to connect using the
server instance referenced in the bind_conf.
When connect has successfully been executed and HTTP/2 connection is
ready for exchange after SETTINGS, reverse_connect task is woken up. As
<pend_conn> is still set, the second halve is executed which only
execute listener_accept(). This will in turn execute accept_conn
callback which is defined to return the pending connection.
The task is automatically requeued inside accept_conn callback if bind
maxconn is not yet reached. This allows to specify how many connection
should be opened. Each connection is instantiated and reversed serially
one by one until maxconn is reached.
conn_free() has been modified to handle failure if a reverse connection
fails before being accepted. In this case, no session exists to notify
about the failure. Instead, reverse_connect task is requeud with a 1
second delay, giving time to fix a possible network issue. This will
allow to attempt a new connection reverse.
Note that for the moment connection rebinding after accept is disabled
for simplicity. Extra operations are required to migrate an existing
connection and its stack to a new thread which will be implemented
later.
Amaury Denoyelle [Wed, 23 Aug 2023 16:02:51 +0000 (18:02 +0200)]
MINOR: connection: prepare init code paths for active reverse
When an active reverse connection is initialized, it has no stream-conn
attached to it contrary to other backend connections. This forces to add
extra check on stream existence in conn_create_mux() and h2_init().
There is also extra checks required for session_accept_fd() after
reverse and accept is done. This is because contrary to other frontend
connections, reversed connections have already initialized their mux and
transport layers. This forces us to skip the majority of
session_accept_fd() initialization part.
Finally, if session_accept_fd() is interrupted due to an early error, a
reverse connection cannot be freed directly or else mux will remain
alone. Instead, the mux destroy callback is used to free all connection
elements properly.
MINOR: proto_reverse_connect: parse rev@ addresses for bind
Implement parsing for "rev@" addresses on bind line. On config parsing,
server name is stored on the bind_conf.
Several new callbacks are defined on reverse_connect protocol to
complete parsing. listen callback is used to retrieve the server
instance from the bind_conf server name. If found, the server instance
is stored on the receiver. Checks are implemented to ensure HTTP/2
protocol only is used by the server.
MINOR: connection: extend conn_reverse() for active reverse
Implement active reverse support inside conn_reverse(). This is used to
transfer the connection from the backend to the frontend side.
A new flag is defined CO_FL_REVERSED which is set just after this
transition. This will be used to identify connections which were
reversed but not yet accepted.
Amaury Denoyelle [Tue, 22 Aug 2023 15:01:57 +0000 (17:01 +0200)]
REGTESTS: provide a reverse-server test with name argument
This regtest is similar to the previous one, except the optional name
argument is specified.
An extra haproxy instance is used as a gateway for clear/TLS as vtest
does not support TLS natively.
A first request is done by specifying a name which does not match the
idle connection SNI. This must result in a HTTP 503. Then the correct
name is used which must result in a 200.
Amaury Denoyelle [Tue, 22 Aug 2023 14:58:55 +0000 (16:58 +0200)]
MINOR: connection: use attach-srv name as SNI reuse parameter on reverse
On connection passive reverse from frontend to backend, its hash node is
calculated to be able to select it from the idle server pool. If
attach-srv rule defined an associated name, reuse it as the value for
SNI prehash.
This change allows a client to select a reverse connection by its name
by configuring its server line with a SNI to permit this.
MINOR: tcp-act: define optional arg name for attach-srv
Add an optional argument 'name' for attach-srv rule. This contains an
expression which will be used as an identifier inside the server idle
pool after reversal. To match this connection for a future transfer
through the server, the SNI server parameter must match this name. If no
name is defined, match will only occur with an empty SNI value.
For the moment, only the parsing step is implemented. An extra check is
added to ensure that the reverse server uses SSL with a SNI. Indeed, if
name is defined but server does not uses a SNI, connections will never
be selected on reused after reversal due to a hash mismatch.
Test support for reverse server. This can be test without the opposite
haproxy reversal support though a combination of VTC clients used to
emit HTTP/2 responses after connection.
This test ensures that first we get a 503 when connecting on a reverse
server with no idle connection. Then a dummy VTC client is connected to
act as as server. It is then expected that the same request is achieved
with a 200 this time.
Create a new tcp-request session rule 'attach-srv'.
The parsing handler is used to extract the server targetted with the
notation 'backend/server'. The server instance is stored in the act_rule
instance under the new union variant 'attach_srv'.
Extra checks are implemented in parsing to ensure attach-srv is only
used for proxy in HTTP mode and with listeners/server with no explicit
protocol reference or HTTP/2 only.
The action handler itself is really simple. It assigns the stored server
instance to the 'reverse' member of the connection instance. It will be
used in a future patch to implement passive reverse-connect.
Amaury Denoyelle [Fri, 18 Aug 2023 14:02:35 +0000 (16:02 +0200)]
MINOR: backend: only allow reuse for reverse server
A reverse server relies solely on its pool of idle connection to
transfer requests which will be populated through a new tcp-request rule
'attach-srv'.
Several changes are required on connect_server() to implement this.
First, reuse mode is forced to always for this type of server. Then, if
no idle connection is found, the request will be aborted. This results
with a 503 HTTP error code, similarly to when no server is available.
Implement reverse-connect server. This server type cannot instantiate
its own connection on transfer. Instead, it can only reuse connection
from its idle pool. These connections will be populated using the future
'tcp-request session attach-srv' rule.
A reverse-connect has no address. Instead, it uses a new custom server
notation with '@' character prefix. For the moment, only '@reverse' is
defined. An extra check is implemented to ensure server is used in a
HTTP proxy.
MEDIUM: h2: reverse connection after SETTINGS reception
Reverse connection after SETTINGS reception if it was set as reversable.
This operation is done in a new function h2_conn_reverse(). It regroups
common changes which are needed for both reversal direction :
H2_CF_IS_BACK is set or unset and timeouts are inverted.
For the moment, only passive reverse is fully implemented. Once done,
the connection instance is directly inserted in its targetted server
pool. It can then be used immediately for future transfers using this
server.
Define a new method conn_reverse(). This method is used to reverse a
connection from frontend to backend or vice-versa depending on its
initial status.
For the moment, passive reverse only is implemented. This covers the
transition from frontend to backend side. The connection is detached
from its owner session which can then be freed. Then the connection is
linked to the server instance.
only for passive connection on
frontend to transfer them on the backend side. This requires to free the
connection session after detaching it from.
MINOR: connection: centralize init/deinit of backend elements
A connection contains extra elements which are only used for the backend
side. Regroup their allocation and deallocation in two new functions
named conn_backend_init() and conn_backend_deinit().
No functional change is introduced with this commit. The new functions
are reused in place of manual alloc/dealloc in conn_new() / conn_free().
This patch will be useful for reverse connect support with connection
conversion from backend to frontend side and vice-versa.
Several CLI handlers use a server argument specified with the format
'<backend>/<server>'. The parsing of this arguement is done in two
steps, first splitting the string with '/' delimiter and then use
get_backend_server() to retrieve the server instance.
Refactor this code sections with the following changes :
* splitting is reimplented using ist API
* get_backend_server() is removed. Instead use the already existing
proxy_be_by_name() then server_find_by_name() which contains
duplicated code with the now removed function.
No functional change occurs with this commit. However, it will be useful
to add new configuration options reusing the same '<backend>/<server>'
for reverse connect.
Willy Tarreau [Thu, 24 Aug 2023 10:01:06 +0000 (12:01 +0200)]
IMPORT: xxhash: update xxHash to version 0.8.2
Peter Varkoly reported a build issue on ppc64le in xxhash.h. Our version
(0.8.1) was the last one 9 months ago, and since then this specific issue
was addressed in 0.8.2, so let's apply the maintenance update.
Willy Tarreau [Tue, 22 Aug 2023 05:22:05 +0000 (07:22 +0200)]
MINOR: pattern: do not needlessly lookup the LRU cache for empty lists
If a pattern list is empty, there's no way we can find its elements in
the pattern cache, so let's avoid this expensive lookup. This can happen
for ACLs or maps loaded from files that may optionally be empty for
example. Doing so improves the request rate by roughly 10% for a single
such match for only 8 threads. That's normal because the LRU cache
pre-creates an entry that is about to be committed for the case the list
lookup succeeds after a miss, so we bypass all this.
BUG/MINOR: quic: allow-0rtt warning must only be emitted with quic bind
When built with USE_QUIC_OPENSSL_COMPAT, a warning is emitted when using
allow-0rtt. However this warning is emitted for every allow-0rtt
keywords on the bind line which is confusing, it must only be done in
case the bind is a quic one. Also this does not handle the case where
the allow-0rtt keyword is in the crt-list.
This patch moves the warning to ssl_quic_initial_ctx() in order to emit
the warning in every useful cases.
MINOR: quic+openssl_compat: Do not start without "limited-quic"
Add a check for limited-quic in check_config_validity() when compiled
with USE_QUIC_OPENSSL_COMPAT so that we prevent a config from starting
accidentally with limited QUIC support. If a QUIC listener is found
when using the compatibility mode and limited-quic is not set, an error
message is reported explaining that the SSL library is not compatible
and proposing the user to enable limited-quic if that's what they want,
and the startup fails.
This partially reverts commit 7c730803d ("MINOR: quic: Warning for
OpenSSL wrapper QUIC bindings without "limited-quic"") since a warning
was not sufficient.
Amaury Denoyelle [Thu, 17 Aug 2023 09:04:42 +0000 (11:04 +0200)]
BUILD/IMPORT: fix compilation with PLOCK_DISABLE_EBO=1
Compilation is broken due to missing __pl_wait_unlock_long() definition
when building with PLOCK_DISABLE_EBO=1. This has been introduced since
the following commit which activates the inlining version of
pl_wait_unlock_long() :
commit 071d689a514dac522ac3654f53bc22214b5716d0
MINOR: threads: inline the wait function for pthread_rwlock emulation
Add an extra check on PLOCK_DISABLE_EBO before choosing the inline or
default version of pl_wait_unlock_long() to fix this.
Willy Tarreau [Thu, 17 Aug 2023 07:04:35 +0000 (09:04 +0200)]
MINOR: pools: use EBO to wait for unlock during pool_flush()
pool_flush() could become a source of contention on the pool's free list
if there are many competing thread using that pool. Let's make sure we
use EBO and not just a simple CPU relaxation there, to avoid disturbing
them.
Willy Tarreau [Thu, 17 Aug 2023 06:59:15 +0000 (08:59 +0200)]
MINOR: atomic: make sure to always relax after a failed CAS
There were a few places left where we forgot to call __ha_cpu_relax()
after a failed CAS, in the HA_ATOMIC_UPDATE_{MIN,MAX} macros, and in
a few sync_* API macros (the same as above plus HA_ATOMIC_CAS and
HA_ATOMIC_XCHG). Let's add them now.
This could have been a cause of contention, particularly with
process_stream() calling stream_update_time_stats() which uses 8 of them
in a call (4 for the server, 4 for the proxy). This may be a possible
explanation for the high CPU consumption reported in GH issue #2251.
This should be backported at least to 2.6 as it's harmless.
Willy Tarreau [Wed, 16 Aug 2023 20:51:37 +0000 (22:51 +0200)]
MINOR: threads: inline the wait function for pthread_rwlock emulation
When using pthread_rwlock emulation, contention is reported on
pl_wait_unlock_long(). This is really not convenient to analyse what is
happening. Now plock supports inlining the wait call for just the lorw
functions by enabling PLOCK_LORW_INLINE_WAIT. Let's do this so that now
the wait time will be precisely reported as either pthread_rwlock_rdlock()
or pthread_rwlock_wrlock() depending on the contended function, but no
more on pl_wait_unlock_long(), which will still be reported for all
other locks.
Willy Tarreau [Wed, 16 Aug 2023 20:40:05 +0000 (22:40 +0200)]
IMPORT: lorw: support inlining the wait call
Now when PLOCK_LORW_INLINE_WAIT is defined, the pl_wait_unlock_long()
calls in pl_lorw_rdlock() and pl_lorw_wrlock() will be inlined so that
all the CPU time is accounted for in the calling function.
Willy Tarreau [Wed, 16 Aug 2023 20:29:45 +0000 (22:29 +0200)]
IMPORT: plock: always expose the inline version of the lock wait function
Doing so will allow to expose the time spent in certain highly
contended functions, which can be desirable for more accurate CPU
profiling. For example this could be done in locking functions that
are already not inlined so that they are the ones being reported as
those consuming the CPU instead of just pl_wait_unlock_long().
Willy Tarreau [Wed, 16 Aug 2023 20:25:34 +0000 (22:25 +0200)]
IMPORT: plock: also support inlining the int code
Commit 9db830b ("plock: support inlining exponential backoff code")
added an option to support inlining of the wait code for longs but
forgot to do it for ints. Let's do it now.
DEV: makefile: fix POSIX compatibility for "range" target
make "range" which was introduced with 06d34d4 ("DEV: makefile: add a
new "range" target to iteratively build all commits") does not work with
POSIX shells (namely: bourne shell), and will fail with this kind of
errors:
This is because arrays and arithmetic expressions which are used for the
"range" target are not supported by sh (unlike bash and other "modern"
interpreters).
However the make "all" target already complies with POSIX, so in this
commit we try to make "range" target POSIX compliant to ensure that the
makefile works as expected on systems where make uses /bin/sh as default
intepreter and where /bin/sh points to POSIX shell.
Willy Tarreau [Mon, 14 Aug 2023 11:03:46 +0000 (13:03 +0200)]
SCRIPTS: git-show-backports: automatic ref and base detection with -m
When running with -m (check for missing backports) we often have to fill
lots of information that can be determined automatically the vast majority
of the time:
- restart point (last cherry-picked ID from one of the last commits)
- current branch (HEAD)
- reference branch (the one that contains most of the last commits)
These elements are not that hard to determine, so let's make sure we
can fall back to them when running in missing mode.
The reference branch is guessed by looking at the upstream branch that
most frequently contains some of the last 10 commits. It can be inaccurate
if multiple branches exist with these commits, or when upstream changes
due to a non-LTS branch disappearing in the middle of the series, in which
case passing "-r" will help. But most of the time it works OK. It also gives
precedence to local branches over remote ones for such choices. A test in
2.4 at commit 793a4b520 correctly shows 2.6/master as the upstream despite
2.5 having been used for the early ones of the tag.
For the restart point, we assume that the most recent commit that was
backported serves as a reference (and not the most recently backported
commit). This means that the usual case where an old commit was found
to be missing will not fool the analysis. Commits are inspected from
2 commits before the last tag, and reordered from the parent's tree
to see which one is the last one.
With this, it's sufficient to issue "git-show-backports -q -m" to get
the list of backports from the upstream branch, restarting from the
last backported one.
Johannes Naab [Thu, 10 Aug 2023 12:10:37 +0000 (14:10 +0200)]
DOC: typo: fix sc-set-gpt references
Only sc-inc-gpc and sc-set-gpt do exist. The mix-up sc-inc-gpt crept in
in 71d189219 (DOC: config: Rework and uniformize how TCP/HTTP rules are
documented, 2021-10-14) and got copied in a92480462 (MINOR: http-rules:
Add missing actions in http-after-response ruleset, 2023-01-05).
BUG/MINOR: stktable: allow sc-add-gpc from tcp-request connection
Following the previous commit's logic, we enable the use of sc-add-gpc
from tcp-request connection since it was probably forgotten in the first
place for sc-set-gpt0, and since sc-add-gpc was inspired from it, it also
lacks its.
As sc-add-gpc was implemented in 5a72d03a58 ("MINOR: stick-table: implement
the sc-add-gpc() action"), this should only be backported to 2.8
BUG/MINOR: stktable: allow sc-set-gpt(0) from tcp-request connection
Both the documentation and original developer intents seem to suggest
that sc-set-gpt/sc-set-gpt0 actions should be available from
tcp-request connection.
Yet because it was probably forgotten when expr support was added to
sc-set-gpt0 in 0d7712dff0 ("MINOR: stick-table: allow sc-set-gpt0 to
set value from an expression") it doesn't work and will report this
kind of errors:
"internal error, unexpected rule->from=0, please report this bug!"
Fixing the code to comply with the documentation and the expected
behavior.
This must be backported to every stable versions.
[for < 2.5, as only sc-set-gpt0 existed back then, the patch must be
manually applied to skip irrelevant parts]
Willy Tarreau [Sat, 12 Aug 2023 17:59:27 +0000 (19:59 +0200)]
[RELEASE] Released version 2.9-dev3
Released version 2.9-dev3 with the following main changes :
- BUG/MINOR: ssl: OCSP callback only registered for first SSL_CTX
- BUG/MEDIUM: h3: Properly report a C-L header was found to the HTX start-line
- MINOR: sample: add pid sample
- MINOR: sample: implement act_conn sample fetch
- MINOR: sample: accept_date / request_date return %Ts / %tr timestamp values
- MEDIUM: sample: implement us and ms variant of utime and ltime
- BUG/MINOR: sample: check alloc_trash_chunk() in conv_time_common()
- DOC: configuration: describe Td in Timing events
- MINOR: sample: implement the T* timer tags from the log-format as fetches
- DOC: configuration: add sample fetches for timing events
- BUG/MINOR: quic: Possible crash when acknowledging Initial v2 packets
- MINOR: quic: Export QUIC traces code from quic_conn.c
- MINOR: quic: Export QUIC CLI code from quic_conn.c
- MINOR: quic: Move TLS related code to quic_tls.c
- MINOR: quic: Add new "QUIC over SSL" C module.
- MINOR: quic: Add a new quic_ack.c C module for QUIC acknowledgements
- CLEANUP: quic: Defined but no more used function (quic_get_tls_enc_levels())
- MINOR: quic: Split QUIC connection code into three parts
- CLEANUP: quic: quic_conn struct cleanup
- MINOR: quic; Move the QUIC frame pool to its proper location
- BUG/MINOR: chunk: fix chunk_appendf() to not write a zero if buffer is full
- BUG/MEDIUM: h3: Be sure to handle fin bit on the last DATA frame
- DOC: configuration: rework the custom log format table
- BUG/MINOR: quic+openssl_compat: Non initialized TLS encryption levels
- CLEANUP: acl: remove cache_idx from acl struct
- REORG: cfgparse: extract curproxy as a global variable
- MINOR: acl: add acl() sample fetch
- BUILD: cfgparse: keep a single "curproxy"
- BUG/MEDIUM: bwlim: Reset analyse expiration date when then channel analyse ends
- MEDIUM: stream: Reset response analyse expiration date if there is no analyzer
- BUG/MINOR: htx/mux-h1: Properly handle bodyless responses when splicing is used
- BUG/MEDIUM: quic: consume contig space on requeue datagram
- BUG/MINOR: http-client: Don't forget to commit changes on HTX message
- CLEANUP: stconn: Move comment about sedesc fields on the field line
- REGTESTS: http: Create a dedicated script to test spliced bodyless responses
- REGTESTS: Test SPLICE feature is enabled to execute script about splicing
- BUG/MINOR: quic: reappend rxbuf buffer on fake dgram alloc error
- BUILD: quic: fix wrong potential NULL dereference
- MINOR: h3: abort request if not completed before full response
- BUG/MAJOR: http-ana: Get a fresh trash buffer for each header value replacement
- CLEANUP: quic: Remove quic_path_room().
- MINOR: quic: Amplification limit handling sanitization.
- MINOR: quic: Move some counters from [rt]x quic_conn anonymous struct
- MEDIUM: quic: Send CONNECTION_CLOSE packets from a dedicated buffer.
- MINOR: quic: Use a pool for the connection ID tree.
- MEDIUM: quic: Allow the quic_conn memory to be asap released.
- MINOR: quic: Release asap quic_conn memory (application level)
- MINOR: quic: Release asap quic_conn memory from ->close() xprt callback.
- MINOR: quic: Warning for OpenSSL wrapper QUIC bindings without "limited-quic"
- REORG: http: move has_forbidden_char() from h2.c to http.h
- BUG/MAJOR: h3: reject header values containing invalid chars
- MINOR: mux-h2/traces: also suggest invalid header upon parsing error
- MINOR: ist: add new function ist_find_range() to find a character range
- MINOR: http: add new function http_path_has_forbidden_char()
- MINOR: h2: pass accept-invalid-http-request down the request parser
- REGTESTS: http-rules: add accept-invalid-http-request for normalize-uri tests
- BUG/MINOR: h1: do not accept '#' as part of the URI component
- BUG/MINOR: h2: reject more chars from the :path pseudo header
- BUG/MINOR: h3: reject more chars from the :path pseudo header
- REGTESTS: http-rules: verify that we block '#' by default for normalize-uri
- DOC: clarify the handling of URL fragments in requests
- BUG/MAJOR: http: reject any empty content-length header value
- BUG/MINOR: http: skip leading zeroes in content-length values
- BUG/MEDIUM: mux-h1: fix incorrect state checking in h1_process_mux()
- BUG/MEDIUM: mux-h1: do not forget EOH even when no header is sent
- BUILD: mux-h1: shut a build warning on clang from previous commit
- DEV: makefile: add a new "range" target to iteratively build all commits
- CI: do not use "groupinstall" for Fedora Rawhide builds
- CI: get rid of travis-ci wrapper for Coverity scan
- BUG/MINOR: quic: mux started when releasing quic_conn
- BUG/MINOR: quic: Possible crash in quic_cc_conn_io_cb() traces.
- MINOR: quic: Add a trace for QUIC conn fd ready for receive
- BUG/MINOR: quic: Possible crash when issuing "show fd/sess" CLI commands
- BUG/MINOR: quic: Missing tasklet (quic_cc_conn_io_cb) memory release (leak)
- BUG/MEDIUM: quic: fix tasklet_wakeup loop on connection closing
- BUG/MINOR: hlua: fix invalid use of lua_pop on error paths
- MINOR: hlua: add hlua_stream_ctx_prepare helper function
- BUG/MEDIUM: hlua: streams don't support mixing lua-load with lua-load-per-thread
- MAJOR: threads/plock: update the embedded library again
- MINOR: stick-table: move the task_queue() call outside of the lock
- MINOR: stick-table: move the task_wakeup() call outside of the lock
- MEDIUM: stick-table: change the ref_cnt atomically
- MINOR: stick-table: better organize the struct stktable
- MEDIUM: peers: update ->commitupdate out of the lock using a CAS
- MEDIUM: peers: drop then re-acquire the wrlock in peer_send_teachmsgs()
- MEDIUM: peers: only read-lock peer_send_teachmsgs()
- MEDIUM: stick-table: use a distinct lock for the updates tree
- MEDIUM: stick-table: touch updates under an upgradable read lock
- MEDIUM: peers: drop the stick-table lock before entering peer_send_teachmsgs()
- MINOR: stick-table: move the update lock into its own cache line
- CLEANUP: stick-table: slightly reorder the stktable struct
- BUILD: defaults: use __WORDSIZE not LONGBITS for MAX_THREADS_PER_GROUP
- MINOR: tools: make ptr_hash() support 0-bit outputs
- MINOR: tools: improve ptr hash distribution on 64 bits
- OPTIM: tools: improve hash distribution using a better prime seed
- OPTIM: pools: use exponential back-off on shared pool allocation/release
- OPTIM: pools: make pool_get_from_os() / pool_put_to_os() not update ->allocated
- MINOR: pools: introduce the use of multiple buckets
- MEDIUM: pools: spread the allocated counter over a few buckets
- MEDIUM: pools: move the used counter over a few buckets
- MEDIUM: pools: move the needed_avg counter over a few buckets
- MINOR: pools: move the failed allocation counter over a few buckets
- MAJOR: pools: move the shared pool's free_list over multiple buckets
- MINOR: pools: make pool_evict_last_items() use pool_put_to_os_no_dec()
- BUILD: pools: fix build error on clang with inline vs forceinline
Willy Tarreau [Sat, 12 Aug 2023 17:58:17 +0000 (19:58 +0200)]
BUILD: pools: fix build error on clang with inline vs forceinline
clang is more picky than gcc regarding duplicate "inline". The functions
declared with "forceinline" don't need to have "inline" since it's already
in the macro.
MAJOR: pools: move the shared pool's free_list over multiple buckets
This aims at further reducing the contention on the free_list when using
global pools. The free_list pointer now appears for each bucket, and both
the alloc and the release code skip to a next bucket when ending on a
contended entry. The default entry used for allocations and releases
depend on the thread ID so that locality is preserved as much as possible
under low contention.
It would be nice to improve the situation to make sure that releases to
the shared pools doesn't consider the first entry's pointer but only an
argument that would be passed and that would correspond to the bucket in
the thread's cache. This would reduce computations and make sure that the
shared cache only contains items whose pointers match the same bucket.
This was not yet done. One possibility could be to keep the same splitting
in the local cache.
With this change, an h2load test with 5 * 160 conns & 40 streams on 80
threads that was limited to 368k RPS with the shared cache jumped to
3.5M RPS for 8 buckets, 4M RPS for 16 buckets, 4.7M RPS for 32 buckets
and 5.5M RPS for 64 buckets.
MINOR: pools: move the failed allocation counter over a few buckets
The failed allocation counter cannot depend on a pointer, but since it's
a perpetually increasing counter and not a gauge, we don't care where
it's incremented. Thus instead we're hashing on the TID. There's no
contention there anyway, but it's better not to waste the room in
the pool's heads and to move that with the other counters.
MEDIUM: pools: move the needed_avg counter over a few buckets
That's the same principle as for ->allocated and ->used. Here we return
the summ of the raw values, so the result still needs to be fed to
swrate_avg(). It also means that we now use the local ->used instead
of the global one for the calculations and do not need to call pool_used()
anymore on fast paths. The number of samples should likely be divided by
the number of buckets, but that's not done yet (better observe first).
A function pool_needed_avg() was added to report aggregated values for
the "show pools" command.
With this change, an h2load made of 5 * 160 conn * 40 streams on 80
threads raised from 1.5M RPS to 6.7M RPS.
MEDIUM: pools: move the used counter over a few buckets
That's the same principle as for ->allocated. The small difference here
is that it's no longer possible to decrement ->used in batches when
releasing clusters from the cache to the shared cache, so the counter
has to be decremented for each of them. But as it provides less
contention and it's done only during forced eviction, it shouldn't be
a problem.
A function "pool_used()" was added to return the sum of the entries.
It's used by pool_alloc_nocache() and pool_free_nocache() which need
to count the number of used entries. It's not a problem since such
operations are done when picking/releasing objects to/from the OS,
but it is a reminder that the number of buckets should remain small.
With this change, an h2load test made of 5 * 160 conn * 40 streams on
80 threads raised from 812k RPS to 1.5M RPS.
MEDIUM: pools: spread the allocated counter over a few buckets
The ->used counter is one of the most stressed, and it heavily
depends on the ->allocated one, so let's first move ->allocated
to a few buckets.
A function "pool_allocated()" was added to return the sum of the entries.
It's important not to abuse it as it does iterate, so everywhere it's
possible to avoid it by keeping a local counter, it's better. Currently
it's used for limited pools which need to make sure they do not allocate
too many objects. That's an acceptable tradeoff to save CPU on large
machines at the expense of spending a little bit more on small ones which
normally are not under load.
Willy Tarreau [Sat, 12 Aug 2023 09:22:27 +0000 (11:22 +0200)]
MINOR: pools: introduce the use of multiple buckets
On many threads and without the shared cache, there can be extreme
contention on the ->allocated counter, the ->free_list pointer, and
the ->used counter. It's possible to limit this contention by spreading
the counters a little bit over multiple entries, that are summed up when
a consultation is needed. The criterion used to spread the values cannot
be related to the thread ID due to migrations, since we need to keep
consistent stats (allocated vs used).
Instead we'll just hash the pointer, it provides an index that does the
job and that is consistent for the object. When having just a few entries
(16 here as it showed almost identical performance between global and
non-global pools) even iterations should be short enough during
measurements to not be a problem.
A pair of functions designed to ease pointer hash bucket calculation were
added, with one of them doing it for thread IDs because allocation failures
will be associated with a thread and not a pointer.
For now this patch only brings in the relevant parts of the infrastructure,
the CONFIG_HAP_POOL_BUCKETS_BITS macro that defaults to 6 bits when 512
threads or more are supported, 5 bits when 128 or more are supported, 4
bits when 16 or more are supported, otherwise 3 bits for small setups.
The array in the pool_head and the two utility functions are already
added. It should have no measurable impact beyond inflating the pool_head
structure.
Willy Tarreau [Sat, 12 Aug 2023 09:07:48 +0000 (11:07 +0200)]
OPTIM: pools: make pool_get_from_os() / pool_put_to_os() not update ->allocated
The pool's allocation counter doesn't strictly require to be updated
from these functions, it may more efficiently be done in the caller
(even out of a loop for pool_flush() and pool_gc()), and doing so will
also help us spread the counters over an array later. The functions
were renamed _noinc and _nodec to make sure we catch any possible
user in an external patch. If needed, the original functions may easily
be reimplemented in an inline function.
OPTIM: pools: use exponential back-off on shared pool allocation/release
Running a stick-table stress with -dMglobal under 56 threads shows
extreme contention on the pool's free_list because it has to be
processed in two phases and only used to implement a cpu_relax() on
the retry path.
Let's at least implement exponential back-off here to limit the neighbor's
noise and reduce the time needed to successfully acquire the pointer. Just
doing so shows there's still contention but almost doubled the performance,
from 1.1 to 2.1M req/s.
Willy Tarreau [Sat, 12 Aug 2023 15:28:13 +0000 (17:28 +0200)]
OPTIM: tools: improve hash distribution using a better prime seed
During tests it was noticed that the current hash is not that good
on 4- and 5- bit hashes. About 7.5% of all the 32-bit primes were tested
as candidates for the hash function, by submitting them 128 arrangements
of N pointers among 40k extracted from haproxy's pools, and the average
fill rates for 1- to 12- bit hashes were measured and compared. It was
clear that some values do not provide great hashes and other ones are
way more resistant.
The current value is not bad at all but delivers 42.6% unique 2-bit
outputs, 41.6% 3-bit, 38.0% 4-bit, 38.2% 5-bit and 37.1% 10-bit. Some
values did perform significantly better, among which 0xacd1be85 which
does 43.2% 2-bit, 42.5% 3-bit, 42.2% 4-bit, 39.2% 5-bit and 37.3% 10-bit.
The reverse value used in the ptr2_hash() was really underperforming and
was replaced with 0x9d28e4e9 which does 49.6%, 40.4%, 42.6%, 39.1%, and
37.2% respectvely.
This should slightly improve the accuracy of the task and memory
profiling, and will be useful for pools.
Willy Tarreau [Fri, 11 Aug 2023 18:11:30 +0000 (20:11 +0200)]
MINOR: tools: improve ptr hash distribution on 64 bits
When testing the pointer hash on 64-bit real pointers (map entries),
it appeared that the shift by 33 bits that hoped to compensate for the
3 nul LSB degrades the hash, and the centering is more optimal on
31-(bits+1)/2. This makes sense since the topmost bit of the
multiplicator is 31, so for an input of 1 bit and 1 bit of output we
would always get zero. With the formula adjusted this way, we can get
up to ~15% more unique entries at 10 bits and ~24% more at 11 bits.
MINOR: tools: make ptr_hash() support 0-bit outputs
When dealing with macro-based size definitions, it is useful to be able
to hash pointers on zero bits so that the macro automatically returns a
constant 0. For now it only supports 1-32. Let's just add this special
case. It's automatically optimized out by the compiler since the function
is inlined.
Willy Tarreau [Sat, 12 Aug 2023 17:01:10 +0000 (19:01 +0200)]
BUILD: defaults: use __WORDSIZE not LONGBITS for MAX_THREADS_PER_GROUP
LONGBITS was defined long ago with old compilers that didn't provide the
word size. It's still present as being referenced in various places in the
code, but we must not use it to define other macros that may be evaluated
at pre-processing time since it contains sizeof() and casts that are not
compatible with preprocessor conditions. Let's switch MAX_THREADS_PER_GROUP
to __WORDSIZE so that we can condition blocks of code on it if needed.
LONGBITS should really be removed by now, given that we don't support
compilers not providing __WORDSIZE anymore (gcc < 4.2).
Willy Tarreau [Mon, 7 Aug 2023 19:32:36 +0000 (21:32 +0200)]
CLEANUP: stick-table: slightly reorder the stktable struct
By moving the config-time stuff after the updt_lock, we can plug some
holes without interfering with it. This allows us to get back to the
768-bytes struct. The performance was not affected at all.
Willy Tarreau [Mon, 7 Aug 2023 19:15:40 +0000 (21:15 +0200)]
MINOR: stick-table: move the update lock into its own cache line
The read-lock contention observed on the update lock while turning it
into an upgradable lock were due to false sharing with the nearby
updates. Simply moving the lock alone into its own cache line is
sufficient to almost double the performance again, raising from 2355
to 4480k RPS with very low contention:
Unfortunately, trying to move the line anywhere else didn't work,
despite the remaining holes, because this structure is not quite
clean. This adds 64 bytes to a struct that was already 768 long,
so it's now 832. It's possible to repack it a little bit and regain
these bytes by removing the THREAD_ALIGN before "keys" because we
rarely use the config stuff, but that's a bit unsafe.
Willy Tarreau [Mon, 7 Aug 2023 18:17:50 +0000 (20:17 +0200)]
MEDIUM: peers: drop the stick-table lock before entering peer_send_teachmsgs()
The function drops the lock very early, and the only operations that
are performed on the entry code are updating the current peer's
last_local_table, which doesn't need to be protected. Thus it's
easier to drop the lock before entering the function and it further
limits its scope.
This has raised the peak RPS from 2050 to 2355k/s with a peers section on
the 80-core machine.
Willy Tarreau [Mon, 7 Aug 2023 19:03:24 +0000 (21:03 +0200)]
MEDIUM: stick-table: touch updates under an upgradable read lock
Instead of taking the update's write lock in stktable_touch_with_exp(),
while most of the time under high load there is nothing to update because
the entry is touched before having been synchronized present, let's do
the check under a read lock and upgrade it to perform the update if
needed. These updates are rare and the contention is not expected to be
very high, so at the first failure to upgrade we retry directly with a
write lock.
By doing so the performance has almost doubled again, from 1140 to 2050k
with a peers section enabled. The contention is now on taking the read
lock itself, so there's little to be gained beyond this in this function.
Willy Tarreau [Sat, 27 May 2023 17:55:15 +0000 (17:55 +0000)]
MEDIUM: stick-table: use a distinct lock for the updates tree
Updating an entry in the updates tree is currently performed under the
table's write lock, which causes huge contention with other accesses
such as lookups and free. Aside the updates tree, the update,
localupdate and commitupdate variables, nothing is manipulated, so
let's create a distinct lock (updt_lock) to protect these together
to remove this contention. It required to add an extra lock in the
few places where we delete the update (though only if we're really
going to delete it) to protect the tree. This is very convenient
because now peer_send_teachmsgs() only needs to take this read lock,
and there is very little contention left on the stick-table.
With this alone, the performance jumped from 614k to 1140k/s on a
80-thread machine with a peers section! Stick-table updates with
no peers however now has to stand two locks and slightly regressed
from 4.0-4.1M/s to 3.9-4.0. This is fairly minimal compared to the
significant unlocking of the peers updates and considered totally
acceptable.
Willy Tarreau [Mon, 7 Aug 2023 17:39:27 +0000 (19:39 +0200)]
MEDIUM: peers: only read-lock peer_send_teachmsgs()
This function doesn't need to be write-locked. It performs a lookup
of the next update at its index, atomically updates the ref_cnt on
the stksess, updates some shared_table fields on the local thread,
and updates the table's commitupdate. Now that this update is atomic
we don't need to keep the write lock during that period. In addition
this function's callers do not rely on the write lock to be held
either since it was droped during peer_send_updatemsg() anyway.
Now, when the function is entered with a write lock, it's downgraded
to a read lock, otherwise a read lock is grabbed. Updates are looked
up under the read lock and the message is sent without the lock. The
commitupdate is still performed under the read lock (so as not to
break the code too much), and the write lock is re-acquired when
leaving if needed. This allows multiple peers to look up updates in
parallel and to avoid stalling stick-table lookups.
MEDIUM: peers: drop then re-acquire the wrlock in peer_send_teachmsgs()
This function maintains the write lock for a while. In practice it does
not need to hold it that long, and some parts could be performed under a
read lock. This patch first drops then re-acquires the write lock at the
function's entry. The purpose is simply to break the end-to-end atomicity
to prove that it has no impact in case something needs to be bisected
later. In fact the write lock is already dropped while calling
peer_send_updatemsg().
MEDIUM: peers: update ->commitupdate out of the lock using a CAS
The ->commitupdate index doesn't need to be kept consistent with other
operations, it only needs to be correct and to reflect the last known
value. Right now it's updated under the stick-table lock, which is
expensive and maintains this lock longer than needed. Let's move it
outside of the lock, and update it using a CAS. This patch simply
replaces the assignment with a CAS and makes sure all reads are atomic.
On failed CAS we use a simple cpu_relax(), no need for more as there
should not be that much contention here (updates are not that fast).
Willy Tarreau [Sat, 27 May 2023 18:07:41 +0000 (18:07 +0000)]
MINOR: stick-table: better organize the struct stktable
The structure currently mixes R/O and R/W fields, let's organize them
by access type, focusing mainly on splitting the updates from the rest
so that peers activity does not affect the rest. For now it doesn't
bring any benefit but it paves the way for splitting the lock.
Willy Tarreau [Sat, 27 May 2023 16:55:48 +0000 (16:55 +0000)]
MEDIUM: stick-table: change the ref_cnt atomically
Due to the ts->ref_cnt being manipulated and checked inside wrlocks,
we continue to have it updated under plenty of read locks, which have
an important cost on many-thread machines.
This patch turns them all to atomic ops and carefully moves them outside
of locks every time this is possible:
- the ref_cnt is incremented before write-unlocking on creation otherwise
the element could vanish before we can do it
- the ref_cnt is decremented after write-locking on release
- for all other cases it's updated out of locks since it's guaranteed by
the sequence that it cannot vanish
- checks are done before locking every time it's used to decide
whether we're going to release the element (saves several write locks)
- expiration tests are just done using atomic loads, since there's no
particular ordering constraint there, we just want consistent values.
For Lua, the loop that is used to dump stick-tables could switch to read
locks only, but this was not done.
For peers, the loop that builds updates in peer_send_teachmsgs is extremely
expensive in write locks and it doesn't seem this is really needed since
the only updated variables are last_pushed and commitupdate, the first
one being on the shared table (thus not used by other threads) and the
commitupdate could likely be changed using a CAS. Thus all of this could
theoretically move under a read lock, but that was not done here.
On a 80-thread machine with a peers section enabled, the request rate
increased from 415 to 520k rps.
Willy Tarreau [Sat, 27 May 2023 18:35:15 +0000 (18:35 +0000)]
MINOR: stick-table: move the task_wakeup() call outside of the lock
The write lock in stktable_touch_with_exp() is quite expensive and should
be shortened as much as possible. There's no need for it when calling
task_wakeup() so let's move it out.
On a 80-thread machine with a peers section, the request rate increased
from 397k to 415k rps.
Willy Tarreau [Sat, 27 May 2023 18:32:05 +0000 (18:32 +0000)]
MINOR: stick-table: move the task_queue() call outside of the lock
The write lock in stktable_requeue_exp() is quite expensive and should
be shortened as much as possible. There's no need for it when calling
task_queue() so let's move it out.
On a 80-thread machine with a peers section, the request rate increased
from 368k to 397k rps.
Willy Tarreau [Mon, 7 Aug 2023 17:16:33 +0000 (19:16 +0200)]
MAJOR: threads/plock: update the embedded library again
This updates the local copy of the plock library to benefit from finer
memory ordering, EBO on more operations such as when take_w() and stow()
wait for readers to leave and refined EBO, especially on common operation
such as attempts to upgade R to S, and avoids a counter-productive prior
read in rtos() and take_r().
These changes have shown a 5% increase on regular operations on ARM,
a 33% performance increase on ARM on stick-tables and 2% on x86, and
a 14% and 4% improvements on peers updates respectively on ARM and x86.
The availability of relaxed operations will probably be useful for stats
counters which are still extremely expensive to update.
The following plock commits were included in this update:
9db830b plock: support inlining exponential backoff code 008d3c2 plock: make the rtos upgrade faster 2f76dde atomic: clean up the generic xchg() 3c6919b atomic: make sure that the no-return macros do not return a value 97c2bb7 atomic: make the fallback bts use the pointed type for the shift f4c1880 atomic: also implement the missing pl_btr() 8329b82 atomic: guard all generic definitions to make it easier to provide specific ones 7c5cb62 atomic: use C11 atomics when available 96afaf9 atomic: prefer the C11 definitions in general f3ec7a6 atomic: implement load/store/atomic barriers 8bdbd1e atomic: add atomic load/stores 0f604c0 atomic: add more _noret operations 3fe35db atomic: remove the (void) cast from the C11 operations 3b08a7c atomic: allow to define the fallback _noret variants 28deb22 atomic: make x86 arithmetic operations the _noret variants 8061fe2 atomic: handle modern compilers that support returning flags b8b91b7 atomic: add the fetch-and-<op> operations (pl_ld<op>) 59817ca atomic: add memory order variants for most operations a40774f plock: explicitly make use of the pl_*_noret operations 6f1861b plock: switch to pl_sub_noret_lax() for cancellation c013980 plock: use pl_ldadd{_lax,_acq,} instead of pl_xadd() 382eea3 plock: use a release ordering when dropping the lock 60d750d plock: use EBO when waiting for readers to leave in take_w() and stow() fc01c4f plock: improve EBO a little bit 1ef6390 plock: switch to CAS + XADD for pl_take_r()
BUG/MEDIUM: hlua: streams don't support mixing lua-load with lua-load-per-thread
Michel Mayen reported that mixing lua actions loaded from 'lua-load'
and 'lua-load-per-thread' directives within a single http/tcp session
yields unexpected results.
When executing action defined in another running context from the one of
the previously executed action (from lua-load, then from
lua-load-per-thread or the opposite, order doesn't matter), it would yield
this kind of error:
"Lua function 'name': [state-id x] runtime error: attempt to call a nil value from ."
He also noted that when loading all actions using the same loading
directive, the issue is gone.
This is due to the fact that for lua actions, fetches and converters, lua
code is being executed from the stream lua context. However, the stream
lua context, which is created on the fly when first executing some lua
code related to the stream, is reused between multiple lua executions.
But the thing is, despite successive executions referring to the same
parent "stream" (which is also assigned to a given thread id), they don't
necessarily depend on the same running context from lua point of view.
Indeed, since the function which is about to be executed could have been
loaded from either 'lua-load' or 'lua-load-per-thread', the function
declaration and related dependencies are defined in a specific stack ID
which is known by calling fcn_ref_to_stack_id() on the given function.
Thus, in order to make streams capable of chaining lua actions, fetches and
converters loaded in different lua stacks, we add a new detection logic
in hlua_stream_ctx_prepare() to be able to recreate the lua context in the
proper stack space when the existing one conflicts with the expected stack
id.
This must be backported in every stable versions.
It depends on:
- "MINOR: hlua: add hlua_stream_prepare helper function"
[for < 2.5, skip the filter part since they didn't exist]
[wt: warning, wait a little bit before backporting too far, we
need to be certain the added BUG_ON() will never trigger]
BUG/MINOR: hlua: fix invalid use of lua_pop on error paths
Multiple error paths made invalid use of lua_pop():
When the stack is emptied using lua_settop(0), lua_pop() (which is
implemented as a lua_settop() macro) should not be used right after,
because it could lead to invalid reads since the stack is already empty.
Unfortunately, some remnants from initial lua stack implementation kept
doing so, resulting in haproxy crashs on some lua runtime errors paths
from time to time (ie: ERRRUN, ERRMEM).
Moreover, the extra lua_pop() instruction, even if it was safe, is totally
pointless in such case.
Removing such unsafe lua_pop() statements when we know that the stack is
already empty.
Amaury Denoyelle [Fri, 11 Aug 2023 14:10:34 +0000 (16:10 +0200)]
BUG/MEDIUM: quic: fix tasklet_wakeup loop on connection closing
It is possible to trigger a loop of tasklets calls if a QUIC connection
is interrupted abruptly by the client. This is caused by the following
interaction :
* FD iocb is woken up for read. This causes a wakeup on quic_conn
tasklet.
* quic_conn_io_cb is run and try to read but fails as the connection
socket is closed (typically with a ECONNREFUSED). FD read is
subscribed to the poller via qc_rcv_buf() which will cause the loop.
The looping will stop automatically once the idle-timeout is expired and
the connection instance is finally released.
To fix this, ensure FD read is subscribed only for transient error cases
(EAGAIN or similar). All other cases are considered as fatal and thus
all future read operations will fail. Note that for the moment, nothing
is reported on the quic_conn which may not skip future reception. This
should be improved in a future commit to accelerate connection closing.
This bug can be reproduced on a frequent occurence by interrupting the
following command. Quic traces should be activated on haproxy side to
detect the loop :
$ ngtcp2-client --tp-file=/tmp/ngtcp2-tp.txt \
--session-file=/tmp/ngtcp2-session.txt \
-r 0.3 -t 0.3 --exit-on-all-streams-close 127.0.0.1 20443 \
"http://127.0.0.1:20443/?s=1024"
The tasklet responsible of handling the remaining QUIC connection object
and its traffic was not released, leading to a memory leak. Furthermore its
callback, quic_cc_conn_io_cb(), should return NULL after this tasklet is
released.
BUG/MINOR: quic: Possible crash when issuing "show fd/sess" CLI commands
->xprt_ctx (struct ssl_sock_ctx) and ->conn (struct connection) must be kept
by the remaining QUIC connection object (struct quic_cc_conn) after having
release the previous one (struct quic_conn) to allow "show fd/sess" commands
to be functional without causing haproxy crashes.
BUG/MINOR: quic: Possible crash in quic_cc_conn_io_cb() traces.
Reset the local cc_qc and qc after having released cc_qc. Note that
cc_qc == qc. This is required to prevent haproxy from crashing
when TRACE_LEAVE() is called.
Willy Tarreau [Wed, 9 Aug 2023 14:52:28 +0000 (16:52 +0200)]
DEV: makefile: add a new "range" target to iteratively build all commits
This will iterate over all commits in the range passed in RANGE, or all
those from master to RANGE if no ".." exists in RANGE, and run "make all"
with the exact same variables. This aims to ease the verification that
no build failure exists inside a series. In case of error, it prints the
faulty commit and stops there with the tree checked out. Example:
$ make-disctcc range RANGE=HEAD
Found 14 commit(s) in range master..HEAD.
Current branch is 20230809-plock+tbl+peers-4
Starting to building now...
[ 1/14 ] 392922bc5 #############################
(...)
Done! 14 commit(s) built successfully for RANGE master..HEAD
Maybe in the future it will automatically use HEAD as a default for RANGE
depending on the feedback.
It's not listed in the help target so as not to encourage users to try it
as it can very quickly become confusing due to the checkouts.
Willy Tarreau [Wed, 9 Aug 2023 14:02:44 +0000 (16:02 +0200)]
BUILD: mux-h1: shut a build warning on clang from previous commit
Commit 5201b4abd ("BUG/MEDIUM: mux-h1: do not forget EOH even when no
header is sent") introduced a build warning on clang due to the remaining
two parenthesis in the expression. Let's fix this. No backport needed.
Willy Tarreau [Wed, 9 Aug 2023 09:58:15 +0000 (11:58 +0200)]
BUG/MEDIUM: mux-h1: do not forget EOH even when no header is sent
Since commit 723c73f8a ("MEDIUM: mux-h1: Split h1_process_mux() to make
code more readable"), outgoing H1 requests with no header at all (i.e.
essentially HTTP/1.0 requests) get delayed by 200ms. Christopher found
that it's due to the fact that we end processing too early and we don't
have the opportunity to send the EOH in this case.
This fix addresses it by verifying if it's required to emit EOH when
retruning from h1_make_headers(). But maybe that block could be moved
after the while loop in fact, or the stop condition in the loop be
revisited not to stop of !htx_is_empty(). The current solution gets the
job done at least.
Willy Tarreau [Wed, 9 Aug 2023 09:51:58 +0000 (11:51 +0200)]
BUG/MEDIUM: mux-h1: fix incorrect state checking in h1_process_mux()
That's a regression introduced in 2.9-dev by commit 723c73f8a ("MEDIUM:
mux-h1: Split h1_process_mux() to make code more readable") and found
by Christopher. The consequence is uncertain but the test definitely was
not right in that it would catch most existing states (H1_MSG_DONE=30).
At least it would emit too many "H1 request fully xferred".
Willy Tarreau [Wed, 9 Aug 2023 09:02:34 +0000 (11:02 +0200)]
BUG/MINOR: http: skip leading zeroes in content-length values
Ben Kallus also noticed that we preserve leading zeroes on content-length
values. While this is totally valid, it would be safer to at least trim
them before passing the value, because a bogus server written to parse
using "strtol(value, NULL, 0)" could inadvertently take a leading zero
as a prefix for an octal value. While there is not much that can be done
to protect such servers in general (e.g. lack of check for overflows etc),
at least it's quite cheap to make sure the transmitted value is normalized
and not taken for an octal one.
This is not really a bug, rather a missed opportunity to sanitize the
input, but is marked as a bug so that we don't forget to backport it to
stable branches.
A combined regtest was added to h1or2_to_h1c which already validates
end-to-end syntax consistency on aggregate headers.
Willy Tarreau [Wed, 9 Aug 2023 06:32:48 +0000 (08:32 +0200)]
BUG/MAJOR: http: reject any empty content-length header value
The content-length header parser has its dedicated function, in order
to take extreme care about invalid, unparsable, or conflicting values.
But there's a corner case in it, by which it stops comparing values
when reaching the end of the header. This has for a side effect that
an empty value or a value that ends with a comma does not deserve
further analysis, and it acts as if the header was absent.
While this is not necessarily a problem for the value ending with a
comma as it will be cause a header folding and will disappear, it is a
problem for the first isolated empty header because this one will not
be recontructed when next ones are seen, and will be passed as-is to the
backend server. A vulnerable HTTP/1 server hosted behind haproxy that
would just use this first value as "0" and ignore the valid one would
then not be protected by haproxy and could be attacked this way, taking
the payload for an extra request.
In field the risk depends on the server. Most commonly used servers
already have safe content-length parsers, but users relying on haproxy
to protect a known-vulnerable server might be at risk (and the risk of
a bug even in a reputable server should never be dismissed).
A configuration-based work-around consists in adding the following rule
in the frontend, to explicitly reject requests featuring an empty
content-length header that would have not be folded into an existing
one:
http-request deny if { hdr_len(content-length) 0 }
The real fix consists in adjusting the parser so that it always expects a
value at the beginning of the header or after a comma. It will now reject
requests and responses having empty values anywhere in the C-L header.
This needs to be backported to all supported versions. Note that the
modification was made to functions h1_parse_cont_len_header() and
http_parse_cont_len_header(). Prior to 2.8 the latter was in
h2_parse_cont_len_header(). One day the two should be refused but the
former is also used by Lua.
The HTTP messaging reg-tests were completed to test these cases.
Thanks to Ben Kallus of Dartmouth College and Narf Industries for
reporting this! (this is in GH #2237).
Willy Tarreau [Tue, 8 Aug 2023 17:35:25 +0000 (19:35 +0200)]
DOC: clarify the handling of URL fragments in requests
We indicate in path/pathq/url that they may contain '#' if the frontend
is configured with "option accept-invalid-http-request", and that option
mentions the fragment as well.