git.ipfire.org Git - thirdparty/haproxy.git/log

]> git.ipfire.org Git - thirdparty/haproxy.git/log

projects / thirdparty / haproxy.git / log

William Lallemand [Wed, 1 Oct 2025 16:55:37 +0000 (18:55 +0200)]

CLEANUP: mjson: remove MJSON_ENABLE_BASE64 code

Remove the code used under #if MJSON_ENABLE_BASE64, which is not used
within haproxy, to ease the maintenance of mjson.

commit | commitdiff | tree

William Lallemand [Wed, 1 Oct 2025 16:53:41 +0000 (18:53 +0200)]

CLEANUP: mjson: remove MJSON_ENABLE_NEXT code

Remove the code used under #if MJSON_ENABLE_NEXT, which is not used
within haproxy, to ease the maintenance of mjson.

commit | commitdiff | tree

William Lallemand [Fri, 3 Oct 2025 14:04:55 +0000 (16:04 +0200)]

CLEANUP: mjson: remove MJSON_ENABLE_PRINT code

Remove the code used under #if MJSON_ENABLE_PRINT, which is not used
within haproxy, to ease the maintenance of mjson.

commit | commitdiff | tree

William Lallemand [Fri, 3 Oct 2025 14:04:39 +0000 (16:04 +0200)]

CLEANUP: mjson: remove MJSON_ENABLE_RPC code

Remove the code used under #if MJSON_ENABLE_RPC, which is not used
within haproxy, to ease the maintenance of mjson.

commit | commitdiff | tree

Aurelien DARRAGON [Thu, 2 Oct 2025 19:40:57 +0000 (21:40 +0200)]

BUG/MINOR: sink: retry attempt for sft server may never occur

Since 9561b9fb6 ("BUG/MINOR: sink: add tempo between 2 connection
attempts for sft servers"), there is a possibility that the tempo we use
to schedule the task expiry may point to TICK_ETERNITY as we add ticks to
tempo with a simple addition that doesn't take care of potential wrapping.

When this happens (although relatively rare, since now_ms only wraps every
49.7 days, but a forced wrap occurs 20 seconds after haproxy is started
so it is more likely to happen there), the process_sink_forward() task
expiry being set to TICK_ETERNITY, it may never be called again, this
is especially true if the ring section only contains a single server.

To fix the issue, we must use tick_add() helper function to set the tempo
value and this way we ensure that the value will never be TICK_ETERNITY.

It must be backported everywhere 9561b9fb6 was backported (up to 2.6
it seems).

commit | commitdiff | tree

Olivier Houchard [Thu, 2 Oct 2025 12:02:03 +0000 (14:02 +0200)]

BUG/MEDIUM: connections: Only avoid creating a mux if we have one

In connect_server(), only avoid creating a mux when we're reusing a
connection, if that connection already has one. We can reuse a
connection with no mux, if we made a first attempt at connecting to the
server and it failed before we could create the mux (or during the mux
creation). The connection will then be reused when trying again.
This fixes a bug where a stream could stall if the first connection
attempt failed before the mux creation. It is easy to reproduce by
creating random memory allocation failure with -dmFail.
This was introduced by commit 4aaf0bfbced22d706af08725f977dcce9845d340,
and thus does not need any backport as long as that commit is not
backported.

commit | commitdiff | tree

Christopher Faulet [Fri, 3 Oct 2025 10:12:51 +0000 (12:12 +0200)]

[RELEASE] Released version 3.3-dev9

Released version 3.3-dev9 with the following main changes :
    - BUG/MINOR: acl: Fix error message about several '-m' parameters
    - MINOR: server: Parse sni and pool-conn-name expressions in a dedicated function
    - BUG/MEDIUM: server: Use sni as pool connection name for SSL server only
    - BUG/MINOR: server: Update healthcheck when server settings are changed via CLI
    - OPTIM: backend: Don't set SNI for non-ssl connections
    - OPTIM: proto_rhttp: Don't set SNI for non-ssl connections
    - OPTIM: tcpcheck: Don't set SNI and ALPN for non-ssl connections
    - BUG/MINOR: tcpcheck: Don't use sni as pool-conn-name for non-SSL connections
    - MEDIUM: server/ssl: Base the SNI value to the HTTP host header by default
    - MEDIUM: httpcheck/ssl: Base the SNI value on the HTTP host header by default
    - OPTIM: tcpcheck: Reorder tcpchek_connect structure fields to fill holes
    - REGTESTS: ssl: Add a script to test the automatic SNI selection
    - MINOR: quic: add useful trace about padding params values
    - BUG/MINOR: quic: too short PADDING frame for too short packets
    - BUG/MINOR: cpu_topo: work around a small bug in musl's CPU_ISSET()
    - BUG/MEDIUM: ssl: Properly initialize msg_controllen.
    - MINOR: quic: SSL session reuse for QUIC
    - BUG/MEDIUM: proxy: fix crash with stop_proxy() called during init
    - MINOR: stats-file: use explicit unsigned integer bitshift for user slots
    - CLEANUP: quic: fix typo in quic_tx trace
    - TESTS: quic: add unit-tests for QUIC TX part
    - MINOR: quic: restore QUIC_HP_SAMPLE_LEN constant
    - REGTESTS: ssl: Fix the script about automatic SNI selection
    - BUG/MINOR: pools: Fix the dump of pools info to deal with buffers limitations
    - MINOR: pools: Don't dump anymore info about pools when purge is forced
    - BUG/MINOR: quic: properly support GSO on backend side
    - BUG/MEDIUM: mux-h2: Reset MUX blocking flags when a send error is caught
    - BUG/MEDIUM: mux-h2; Don't block reveives in H2_CS_ERROR and H2_CS_ERROR2 states
    - BUG/MEDIUM: mux-h2: Restart reading when mbuf ring is no longer full
    - BUG/MINOR: mux-h2: Remove H2_CF_DEM_DFULL flags when the demux buffer is reset
    - BUG/MEDIUM: mux-h2: Report RST/error to app-layer stream during 0-copy fwding
    - BUG/MEDIUM: mux-h2: Reinforce conditions to report an error to app-layer stream
    - BUG/MINOR: hq-interop: adjust parsing/encoding on backend side
    - OPTIM: check: do not delay MUX for ALPN if SSL not active
    - BUG/MEDIUM: checks: fix ALPN inheritance from server
    - BUG/MINOR: check: ensure checks are compatible with QUIC servers
    - MINOR: check: reject invalid check config on a QUIC server
    - MINOR: debug: report the process id in warnings and panics
    - DEBUG: stream: count the number of passes in the connect loop
    - MINOR: debug: report the number of loops and ctxsw for each thread
    - MINOR: debug: report the time since last wakeup and call
    - DEBUG: peers: export functions that use locks
    - MINOR: stick-table: permit stksess_new() to temporarily allocate more entries
    - MEDIUM: stick-tables: relax stktable_trash_oldest() to only purge what is needed
    - MEDIUM: stick-tables: give up on lock contention in process_table_expire()
    - MEDIUM: stick-tables: don't wait indefinitely in stktable_add_pend_updates()
    - MEDIUM: peers: don't even try to process updates under contention
    - BUG/MEDIUM: h1: Allow reception if we have early data
    - BUG/MEDIUM: ssl: create the mux immediately on early data
    - MINOR: ssl: Add a flag to let it known we have an ALPN negociated
    - MINOR: ssl: Use the new flag to know when the ALPN has been set.
    - MEDIUM: server: Introduce the concept of path parameters
    - CLEANUP: backend: clarify the role of the init_mux variable in connect_server()
    - CLEANUP: backend: invert the condition to start the mux in connect_server()
    - CLEANUP: backend: simplify the complex ifdef related to 0RTT in connect_server()
    - CLEANUP: backend: clarify the cases where we want to use early data
    - MEDIUM: server: Make use of the stored ALPN stored in the server
    - BUILD: ssl: address a recent build warning when QUIC is enabled
    - BUG/MINOR: activity: fix reporting of task latency
    - MINOR: activity: indicate the number of calls on "show tasks"
    - MINOR: tools: don't emit "+0" for symbol names which exactly match known ones
    - BUG/MEDIUM: stick-tables: don't loop on non-expirable entries
    - DEBUG: stick-tables: export stktable_add_pend_updates() for better reporting
    - BUG/MEDIUM: ssl: Fix a crash when using QUIC
    - BUG/MEDIUM: ssl: Fix a crash if we failed to create the mux
    - MEDIUM: dns: bind the nameserver sockets to the initiating thread
    - MEDIUM: resolvers: make the process_resolvers() task single-threaded
    - BUG/MINOR: stick-table: make sure never to miss a process_table_expire update
    - MEDIUM: stick-table: move process_table_expire() to a single thread
    - MEDIUM: peers: move process_peer_sync() to a single thread
    - BUG/MAJOR: stream: Force channel analysis on successful synchronous send
    - MINOR: quic: get rid of ->target quic_conn struct member
    - MINOR: quic-be: make SSL/QUIC objects use their own indexes (ssl_qc_app_data_index)
    - MINOR: quic: display build warning for compat layer on recent OpenSSL
    - DOC: quic: clarifies limited-quic support
    - BUG/MINOR: acme: null pointer dereference upon allocation failure
    - BUG/MEDIUM: jws: return size_t in JWS functions
    - BUG/MINOR: ssl: Potential NULL deref in trace macro
    - BUG/MINOR: ssl: Fix potential NULL deref in trace callback
    - BUG/MINOR: ocsp: prototype inconsistency
    - MINOR: ocsp: put internal functions as static ones
    - MINOR: ssl: set functions as static when no protypes in the .h
    - BUILD: ssl: functions defined but not used
    - BUG/MEDIUM: resolvers: Properly cache do-resolv resolution
    - BUG/MINOR: resolvers: Restore round-robin selection on records in DNS answers
    - MINOR: activity: don't report the lat_tot column for show profiling tasks
    - MINOR: activity: add a new lkw_avg column to show profiling stats
    - MINOR: activity: collect time spent waiting on a lock for each task
    - MINOR: thread: add a lock level information in the thread_ctx
    - MINOR: activity: add a new lkd_avg column to show profiling stats
    - MINOR: activity: collect time spent with a lock held for each task
    - MINOR: activity: add a new mem_avg column to show profiling stats
    - MINOR: activity: collect CPU time spent on memory allocations for each task
    - MINOR: activity/memory: count allocations performed under a lock
    - DOC: proxy-protocol: Add TLS group and sig scheme TLVs
    - BUG/MEDIUM: resolvers: Test for empty tree when getting a record from DNS answer
    - BUG/MEDIUM: resolvers: Make resolution owns its hostname_dn value
    - BUG/MEDIUM: resolvers: Accept to create resolution without hostname
    - BUG/MEDIUM: resolvers: Wake resolver task up whne unlinking a stream requester
    - BUG/MINOR: ocsp: Crash when updating CA during ocsp updates
    - Revert "BUG/MINOR: ocsp: Crash when updating CA during ocsp updates"
    - BUG/MEDIUM: http_ana: fix potential NULL deref in http_process_req_common()
    - MEDIUM: log/proxy: store log-steps selection using a bitmask, not an eb tree
    - BUG/MINOR: ocsp: Crash when updating CA during ocsp updates
    - BUG/MINOR: resolvers: always normalize FQDN from response
    - BUILD: makefile: implement support for running a command in range
    - IMPORT: cebtree: import version 0.5.0 to support duplicates
    - MEDIUM: migrate the patterns reference to cebs_tree
    - MEDIUM: guid: switch guid to more compact cebuis_tree
    - MEDIUM: server: switch addr_node to cebis_tree
    - MEDIUM: server: switch conf.name to cebis_tree
    - MEDIUM: server: switch the host_dn member to cebis_tree
    - MEDIUM: proxy: switch conf.name to cebis_tree
    - MEDIUM: stktable: index table names using compact trees
    - MINOR: proxy: add proxy_get_next_id() to find next free proxy ID
    - MINOR: listener: add listener_get_next_id() to find next free listener ID
    - MINOR: server: add server_get_next_id() to find next free server ID
    - CLEANUP: server: use server_find_by_id() when looking for already used IDs
    - MINOR: server: add server_index_id() to index a server by its ID
    - MINOR: listener: add listener_index_id() to index a listener by its ID
    - MINOR: proxy: add proxy_index_id() to index a proxy by its ID
    - MEDIUM: proxy: index proxy ID using compact trees
    - MEDIUM: listener: index listener ID using compact trees
    - MEDIUM: server: index server ID using compact trees
    - CLEANUP: server: slightly reorder fields in the struct to plug holes
    - CLEANUP: proxy: slightly reorganize fields to plug some holes
    - CLEANUP: backend: factor the connection lookup loop
    - CLEANUP: server: use eb64_entry() not ebmb_entry() to convert an eb64
    - MINOR: server: pass the server and thread to srv_migrate_conns_to_remove()
    - CLEANUP: backend: use a single variable for removed in srv_cleanup_idle_conns()
    - MINOR: connection: pass the thread number to conn_delete_from_tree()
    - MEDIUM: connection: move idle connection trees to ceb64
    - MEDIUM: connection: reintegrate conn_hash_node into connection
    - CLEANUP: tools: use the item API for the file names tree
    - CLEANUP: vars: use the item API for the variables trees
    - BUG/MEDIUM: pattern: fix possible infinite loops on deletion
    - CI: scripts: add support for git in openssl builds
    - CI: github: add an OpenSSL + ECH job
    - CI: scripts: mkdir BUILDSSL_TMPDIR
    - Revert "BUG/MEDIUM: pattern: fix possible infinite loops on deletion"
    - BUG/MEDIUM: pattern: fix possible infinite loops on deletion (try 2)
    - CLEANUP: log: remove deadcode in px_parse_log_steps()
    - MINOR: counters: document that tg shared counters are tied to shm-stats-file mapping
    - DOC: internals: document the shm-stats-file format/mapping
    - IMPORT: ebtree: delete unusable ebpttree.c
    - IMPORT: eb32/eb64: reorder the lookup loop for modern CPUs
    - IMPORT: eb32/eb64: use a more parallelizable check for lack of common bits
    - IMPORT: eb32: drop the now useless node_bit variable
    - IMPORT: eb32/eb64: place an unlikely() on the leaf test
    - IMPORT: ebmb: optimize the lookup for modern CPUs
    - IMPORT: eb32/64: optimize insert for modern CPUs
    - IMPORT: ebtree: only use __builtin_prefetch() when supported
    - IMPORT: ebst: use prefetching in lookup() and insert()
    - IMPORT: ebtree: Fix UB from clz(0)
    - IMPORT: ebtree: add a definition of offsetof()
    - IMPORT: ebtree: replace hand-rolled offsetof to avoid UB
    - MINOR: listener: add the "cc" bind keyword to set the TCP congestion controller
    - MINOR: server: add the "cc" keyword to set the TCP congestion controller
    - BUG/MEDIUM: ring: invert the length check to avoid an int overflow
    - MINOR: trace: don't call strlen() on the thread-id numeric encoding
    - MINOR: trace: don't call strlen() on the function's name
    - OPTIM: sink: reduce contention on sink_announce_dropped()
    - OPTIM: sink: don't waste time calling sink_announce_dropped() if busy
    - CLEANUP: ring: rearrange the wait loop in ring_write()
    - OPTIM: ring: always relax in the ring lock and leader wait loop
    - OPTIM: ring: check the queue's owner using a CAS on x86
    - OPTIM: ring: avoid reloading the tail_ofs value before the CAS in ring_write()
    - BUG/MEDIUM: sink: fix unexpected double postinit of sink backend
    - MEDIUM: stats: consider that shared stats pointers may be NULL
    - BUG/MEDIUM: http-client: Fix the test on the response start-line
    - MINOR: acme: acme-vars allow to pass data to the dpapi sink
    - MINOR: acme: check acme-vars allocation during escaping
    - BUG/MINOR: acme/cli: wrong description for "acme challenge_ready"
    - CI: move VTest preparation & friends to dedicated composite action
    - BUG/MEDIUM: stick-tables: Don't let table_process_entry() handle refcnt
    - BUG/MINOR: compression: Test payload size only if content-length is specified
    - BUG/MINOR: pattern: Properly flag virtual maps as using samples
    - BUG/MINOR: acme: possible overflow on scheduling computation
    - BUG/MINOR: acme: possible overflow in acme_will_expire()
    - CLEANUP: acme: acme_will_expire() uses acme_schedule_date()
    - BUG/MINOR: pattern: Fix pattern lookup for map with opt@ prefix
    - CI: scripts: build curl with ECH support
    - CI: github: add curl+ech build into openssl-ech job
    - BUG/MEDIUM: ssl: ca-file directory mode must read every certificates of a file
    - MINOR: acme: provider-name for dpapi sink
    - BUILD: acme: fix false positive null pointer dereference
    - MINOR: backend: srv_queue helper
    - MINOR: backend: srv_is_up converter
    - BUILD: halog: misleading indentation in halog.c
    - CI: github: build halog on the vtest job
    - BUG/MINOR: acme: don't unlink from acme_ctx_destroy()
    - BUG/MEDIUM: acme: cfg_postsection_acme() don't init correctly acme sections
    - MINOR: acme: implement "reuse-key" option
    - ADMIN: haproxy-dump-certs: implement a certificate dumper
    - ADMIN: dump-certs: don't update the file if it's up to date
    - ADMIN: dump-certs: create files in a tmpdir
    - ADMIN: dump-certs: fix lack of / in -p
    - ADMIN: dump-certs: use same error format as haproxy
    - ADMIN: reload: add a synchronous reload helper
    - BUG/MEDIUM: acme: free() of i2d_X509_REQ() with AWS-LC
    - ADMIN: reload: introduce verbose and silent mode
    - ADMIN: reload: introduce -vv mode
    - MINOR: mt_list: Implement MT_LIST_POP_LOCKED()
    - BUG/MEDIUM: stick-tables: Make sure not to free a pending entry
    - MINOR: sched: let's permit to share the local ctx between threads
    - MINOR: sched: pass the thread number to is_sched_alive()
    - BUG/MEDIUM: wdt: improve stuck task detection accuracy
    - MINOR: ssl: add the ssl_bc_sni sample fetch function to retrieve backend SNI
    - MINOR: rawsock: introduce CO_RFL_TRY_HARDER to detect closures on complete reads
    - MEDIUM: ssl: don't always process pending handshakes on closed connections
    - MEDIUM: servers: Schedule the server requeue target on creation
    - MEDIUM: fwlc: Make it so fwlc_srv_reposition works with unqueued srv
    - BUG/MEDIUM: fwlc: Handle memory allocation failures.
    - DOC: config: clarify some known limitations of the json_query() converter
    - BUG/CRITICAL: mjson: fix possible DoS when parsing numbers
    - BUG/MINOR: h2: forbid 'Z' as well in header field names checks
    - BUG/MINOR: h3: forbid 'Z' as well in header field names checks
    - BUG/MEDIUM: resolvers: break an infinite loop in resolv_get_ip_from_response()

commit | commitdiff | tree

Willy Tarreau [Fri, 3 Oct 2025 07:00:13 +0000 (09:00 +0200)]

BUG/MEDIUM: resolvers: break an infinite loop in resolv_get_ip_from_response()

The fix in 3023e98199 ("BUG/MINOR: resolvers: Restore round-robin
selection on records in DNS answers") still contained an issue not
addressed f6dfbbe870 ("BUG/MEDIUM: resolvers: Test for empty tree
when getting a record from DNS answer"). Indeed, if the next element
is the same as the first one, then we can end up with an endless loop
because the test at the end compares the next pointer (possibly null)
with the end one (first).

Let's move the null->first transition at the end. This must be
backported where the patches above were backported (3.2 for now).

commit | commitdiff | tree

zhanhb [Sat, 27 Sep 2025 15:01:32 +0000 (23:01 +0800)]

BUG/MINOR: h3: forbid 'Z' as well in header field names checks

The current tests in _h3_handle_hdr() and h3_trailers_to_htx() check
for an interval between 'A' and 'Z' for letters in header field names
that should be forbidden, but mistakenly leave the 'Z' out of the
forbidden range, resulting in it being implicitly valid.

This has no real consequences but should be fixed for the sake of
protocol validity checking.

This must be backported to all relevant versions.

commit | commitdiff | tree

zhanhb [Sat, 27 Sep 2025 15:01:32 +0000 (23:01 +0800)]

BUG/MINOR: h2: forbid 'Z' as well in header field names checks

The current tests in h2_make_htx_request(), h2_make_htx_response()
and h2_make_htx_trailers() check for an interval between 'A' and 'Z'
for letters in header field names that should be forbidden, but
mistakenly leave the 'Z' out of the forbidden range, resulting in it
being implicitly valid.

This has no real consequences but should be fixed for the sake of
protocol validity checking.

This must be backported to all relevant versions.

commit | commitdiff | tree

Willy Tarreau [Mon, 29 Sep 2025 16:34:11 +0000 (18:34 +0200)]

BUG/CRITICAL: mjson: fix possible DoS when parsing numbers

Mjson comes with its own strtod() implementation for portability
reasons and probably also because many generic strtod() versions as
provided by operating systems do not focus on resource preservation
and may call malloc(), which is not welcome in a parser.

The strtod() implementation used here apparently originally comes from
https://gist.github.com/mattn/1890186 and seems to have purposely
omitted a few parts that were considered as not needed in this context
(e.g. skipping white spaces, or setting errno). But when subject to the
relevant test cases of the designated file above, the current function
provides the same results.

The aforementioned implementation uses pow() to calculate exponents,
but mjson authors visibly preferred not to introduce a libm dependency
and replaced it with an iterative loop in O(exp) time. The problem is
that the exponent is not bounded and that this loop can take a huge
amount of time. There's even an issue already opened on mjson about
this: https://github.com/cesanta/mjson/issues/59. In the case of
haproxy, fortunately, the watchdog will quickly stop a runaway process
but this remains a possible denial of service.

A first approach would consist in reintroducing pow() like in the
original implementation, but if haproxy is built without Lua nor
51Degrees, -lm is not used so this will not work everywhere.

Anyway here we're dealing with integer exponents, so an easy alternate
approach consists in simply using shifts and squares, to compute the
exponent in O(log(exp)) time. Not only it doesn't introduce any new
dependency, but it turns out to be even faster than the generic pow()
(85k req/s per core vs 83.5k on the same machine).

This must be backported as far as 2.4, where mjson was introduced.

Many thanks to Oula Kivalo for reporting this issue.

CVE-2025-11230 was assigned to this issue.

commit | commitdiff | tree

Willy Tarreau [Thu, 2 Oct 2025 02:52:33 +0000 (04:52 +0200)]

DOC: config: clarify some known limitations of the json_query() converter

Oula Kivalo reported that different JSON libraries may process duplicate
keys differently and that most JSON libraries usually decode the stream
before extracting keys, while the current mjson implementation decodes the
contents during extraction instead. Let's document this point so that
users are aware of the limitations and do not rely on the current behavior
and do not use it for what it's not made for (e.g. content sanitization).

This is also the case for jwt_header_query(), jwt_payload_query() and
jwt_verify(), which already refer to this converter for specificities.

commit | commitdiff | tree

Olivier Houchard [Wed, 1 Oct 2025 16:05:53 +0000 (18:05 +0200)]

BUG/MEDIUM: fwlc: Handle memory allocation failures.

Properly handle memory allocation failures, by checking the return value
for pool_alloc(), and if it fails, make sure that the caller will take
it into account.
The only use of pool_alloc() in fwlc is to allocate the tree elements in
order to properly queue the server into the ebtree, so if that
allocation fails, just schedule the requeue tasklet, that will try
again, until it hopefully eventually succeeds.

This should be backported to 3.2.
This should fix github issue #3143.

commit | commitdiff | tree

Olivier Houchard [Wed, 1 Oct 2025 15:59:37 +0000 (17:59 +0200)]

MEDIUM: fwlc: Make it so fwlc_srv_reposition works with unqueued srv

Modify fwlc_srv_reposition() so that it does not assume that the server
was already queued, and so make it so it works even if s->tree_elt is
NULL.
While the server will usually be queued, there is an unlikely
possibility that when the server attempted to get queued when it got up,
it failed due to a memory allocation failure, and it just expect the
server_requeue tasklet to run to take care of that later.

This should be backported to 3.2.
This is part of an attempt to fix github issue #3143

commit | commitdiff | tree

Olivier Houchard [Wed, 1 Oct 2025 15:56:52 +0000 (17:56 +0200)]

MEDIUM: servers: Schedule the server requeue target on creation

On creation, schedule the server requeue once it's been created.
It is possible that when the server went up, it tried to queue itself
into the lb specific code, failed to do so, and expect the tasklet to
run to take care of that.

This should be backported to 3.2.
This is part of an attempt to fix github issue #3143.

commit | commitdiff | tree

Willy Tarreau [Mon, 29 Sep 2025 12:13:26 +0000 (14:13 +0200)]

MEDIUM: ssl: don't always process pending handshakes on closed connections

If a client aborts a pending SSL connection for whatever reason (timeout
etc) and the listen queue is large, it may inflict a severe load to a
frontend which will spend the CPU creating new sessions then killing the
connection. This is similar to HTTP requests aborted just after being
sent, except that asymmetric crypto is way more expensive.

Unfortunately "option abortonclose" has no effect on this, because it
only applies at a higher level.

This patch ensures that handshakes being received on a frontend having
"option abortonclose" set will be checked for a pending close, and if
this is the case, then the connection will be aborted before the heavy
calculations. The principle is to use recv(MSG_PEEK) to detect the end,
and to destroy the pending handshake data before returning to the SSL
library so that it cannot start computing, notices the error and stops.
We don't do it without abortonclose though, because this can be used for
health checks from other haproxy nodes or even other components which
just want to see a handshake succeed.

This is in relation with GH issue #3124.

commit | commitdiff | tree

Willy Tarreau [Mon, 29 Sep 2025 12:05:55 +0000 (14:05 +0200)]

MINOR: rawsock: introduce CO_RFL_TRY_HARDER to detect closures on complete reads

Normally, when reading a full buffer, or exactly the requested size, it
is not really possible to know if the peer had closed immediately after,
and usually we don't care. There's a problematic case, though, which is
with SSL: the SSL layer reads in small chunks of a few bytes, and can
consume a client_hello this way, then start computation without knowing
yet that the client has aborted. In order to permit knowing more, we now
introduce a new read flag, CO_RFL_TRY_HARDER, which says that if we've
read up to the permitted limit and the flag is set, then we attempt one
extra byte using MSG_PEEK to detect whether the connection was closed
immediately after that content or not. The first use case will obviously
be related to SSL and client_hello, but it might possibly also make sense
on HTTP responses to detect a pending FIN at the end of a response (e.g.
if a close was already advertised).

commit | commitdiff | tree

Willy Tarreau [Mon, 29 Sep 2025 11:30:12 +0000 (13:30 +0200)]

MINOR: ssl: add the ssl_bc_sni sample fetch function to retrieve backend SNI

Sometimes in order to debug certain difficult situations it can be useful
to know what SNI was configured on a connection going to a server, for
example to match it against what the server saw or to detect cases where
a server would route on SNI instead of Host. This sample fetch function
simply retrieves the SNI configured on the backend connection, if any.

commit | commitdiff | tree

Willy Tarreau [Wed, 1 Oct 2025 06:28:54 +0000 (08:28 +0200)]

BUG/MEDIUM: wdt: improve stuck task detection accuracy

The fact that the watchdog timer measures the execution time from the
last return from the poller tends to amplify the impact of multiple
bad tasks, and may explain some of the panics reported by Felipe and
Ricardo in GH issues #3084, #3092 and #3101. The problem is that we
check the time if we see that the scheduler appears not to be moving
anymore, but one situation may still arise and catch a bad task:
  - one slow task takes so long a time that it triggers the watchdog
    twice, emitting a warning the second time (~200ms). The scheduler
    is rightfully marked as stuck.
  - then it completes and the scheduler is no longer stuck. Many other
    tasks run in turn, they all take quite some time but not enough to
    trigger a warning. But collectively their cost adds up.
  - then a task takes more than the warning time (100ms), and causes
    the total execution time to cross the second. The watchdog is
    called, sees that we've spend more than 1 second since we left the
    poller, and marks the thread as stuck.
  - the task is not finished, the watchdog is called again, sees more
    than one second with a stuck thread and panics 100ms later.

The total time away from the poller is indeed more than one second,
which is very bad, but no single task caused this individually, and
while the warnings are OK, the watchdog should not panic in this case.

This patch revisits the approach to store the moment the scheduler was
marked as stuck in the wdt context. The idea is that this date will be
used to detect warnings and panics. And by doing so and exploiting the
new is_sched_alive(thr), we can greatly simplify the mechanism so that
the signal handling thread does the strict minimum (mark the scheduler
as possibly stuck and update the stuck_start date), and only bounces to
the reporting thread if the scheduler made no progress since last call.
This means that without even doing computations in the handing thread,
we can continue to avoid all bounces unless a warning is required. Then
when the reporting thread is signaled, it will check the dates from the
last moment the scheduler was marked, and will decide to warn or panic.

The panic decision continues to pass via a TH_FL_STUCK flag to probe the
code so that exceptionally slow code (e.g. live cert generation etc) can
still find a way to avoid the panic if absolutely certain that things
are still moving.

This means that now we have the guarantee that panics will only happen
if a given task spends more than one full second not moving, and that
warnings will be issued for other calls crossing the warn delay boundary.

This was tested using artificially slow operations, and all combinations
which individually took less than a second only resulted in floods of
warnings even if the total reported time in the warning was much higher,
while those above one second provoked the panic.

One improvement could consist in reporting the time since last stuck
in the thread dumps to differentiate the individual task from the whole
set.

This needs to be backported to 3.2 along with the two previous patches:

    MINOR: sched: let's permit to share the local ctx between threads
    MINOR: sched: pass the thread number to is_sched_alive()

commit | commitdiff | tree

Willy Tarreau [Wed, 1 Oct 2025 05:58:01 +0000 (07:58 +0200)]

MINOR: sched: pass the thread number to is_sched_alive()

Now it will be possible to query any thread's scheduler state, not
only the current one. This aims at simplifying the watchdog checks
for reported threads. The operation is now a simple atomic xchg.

commit | commitdiff | tree

Willy Tarreau [Tue, 30 Sep 2025 16:56:41 +0000 (18:56 +0200)]

MINOR: sched: let's permit to share the local ctx between threads

The watchdog timer has to go through complex operations due to not being
able to check if another thread's scheduler is still ticking. This is
simply because the scheduler status is marked as thread-local while it
could in fact also be an array. Let's do that (and align the array to
avoid false sharing) so that it's now possible to check any scheduler's
status.

commit | commitdiff | tree

Olivier Houchard [Tue, 30 Sep 2025 13:16:44 +0000 (15:16 +0200)]

BUG/MEDIUM: stick-tables: Make sure not to free a pending entry

There is a race condition, an entry can be free'd by stksess_kill()
between the time stktable_add_pend_updates() gets the entry from the
mt_list, and the time it adds it to the ebtree.
To prevent this, use the newly implemented MT_LIST_POP_LOCKED() to keep
the stksess locked until it is added to the tree. That way,
__stksess_kill() will wait until we're done with it.

This should be backported to 3.2.

commit | commitdiff | tree

Olivier Houchard [Tue, 30 Sep 2025 12:39:01 +0000 (14:39 +0200)]

MINOR: mt_list: Implement MT_LIST_POP_LOCKED()

Implement MT_LIST_POP_LOCKED(), that behaves as MT_LIST_POP() and
removes the first element from the list, if any, but keeps it locked.

This should be backported to 3.2, as it will be use in a bug fix in the
stick tables that affects 3.2 too.

commit | commitdiff | tree

William Lallemand [Mon, 29 Sep 2025 17:21:08 +0000 (19:21 +0200)]

ADMIN: reload: introduce -vv mode

The -v verbose mode displays the loading messages returned by the master
CLI reload command upon error.

The new -vv mode displays the loading messages even upon success,
showing the content of `show startup-logs` after the reload attempt.

commit | commitdiff | tree

William Lallemand [Sun, 28 Sep 2025 20:45:04 +0000 (22:45 +0200)]

ADMIN: reload: introduce verbose and silent mode

By default haproxy-reload displays the error that are not emitted by
haproxy, but only emitted by haproxy-reload.

-s silent mode, don't display any error

-v verbose mode, display the loading messages returned by the master CLI
reload command upon error.

commit | commitdiff | tree

William Lallemand [Mon, 29 Sep 2025 11:28:11 +0000 (13:28 +0200)]

BUG/MEDIUM: acme: free() of i2d_X509_REQ() with AWS-LC

When using AWS-LC, the free() of the data ptr resulting from
i2d_X509_REQ() might crash, because it uses the free() of the libc
instead of OPENSSL_free().

It does not seems to be a problem on openssl builds.

Must be backported in 3.2.

commit | commitdiff | tree

William Lallemand [Sun, 28 Sep 2025 20:08:27 +0000 (22:08 +0200)]

ADMIN: reload: add a synchronous reload helper

haproxy-reload is a utility script which reload synchronously using the
master CLI, instead of asynchronously with kill.

commit | commitdiff | tree

William Lallemand [Sun, 28 Sep 2025 18:21:07 +0000 (20:21 +0200)]

ADMIN: dump-certs: use same error format as haproxy

Replace error/notice by [ALERT]/[WARNING]/[NOTICE] like it's done in
haproxy.

ALERT means a failure and the program will exit 1 just after it
WARNING will continue the execution of the program
NOTICE will continue the execution as well

commit | commitdiff | tree

William Lallemand [Sun, 28 Sep 2025 15:18:31 +0000 (17:18 +0200)]

ADMIN: dump-certs: fix lack of / in -p

Add a trailing / so -p don't fail if it wasn't specified.

commit | commitdiff | tree

William Lallemand [Sun, 28 Sep 2025 15:16:43 +0000 (17:16 +0200)]

ADMIN: dump-certs: create files in a tmpdir

Files dumped from the socket are put in a temporary directory, this
directory is then removed upon exit.

Variable were cleaned to be clearer:
- crt_filename -> prev_crt
- key_filename -> prev_key
- ${crt_filename}.${tmp} -> new_crt
- ${key_filename}.${tmp} -> new_key

commit | commitdiff | tree

William Lallemand [Sun, 28 Sep 2025 14:33:37 +0000 (16:33 +0200)]

ADMIN: dump-certs: don't update the file if it's up to date

Compare the fingerprint of the leaf certificate to the previous file to
check if it needs to be updated or not

Also skip the check if no file is on the disk.

commit | commitdiff | tree

William Lallemand [Sun, 28 Sep 2025 11:33:23 +0000 (13:33 +0200)]

ADMIN: haproxy-dump-certs: implement a certificate dumper

haproxy-dump0-certs is a bash script that connects to your master socket
or your stat socket in order to dump certificates from haproxy memory to
the corresponding files.

commit | commitdiff | tree

William Lallemand [Sat, 27 Sep 2025 19:41:39 +0000 (21:41 +0200)]

MINOR: acme: implement "reuse-key" option

The new "reuse-key" option in the "acme" section, allows to keep the
private key instead of generating a new one at each renewal.

commit | commitdiff | tree

William Lallemand [Sat, 27 Sep 2025 17:58:44 +0000 (19:58 +0200)]

BUG/MEDIUM: acme: cfg_postsection_acme() don't init correctly acme sections

The cfg_postsection_acme() redefines its own cur_acme variable, pointing
to the first acme section created. Meaning that the first section would
be init multiple times, and the next sections won't never be
initialized.

It could result in crashes at the first use of all sections that are not
the first one.

Must be backported in 3.2

commit | commitdiff | tree

William Lallemand [Sat, 27 Sep 2025 16:38:17 +0000 (18:38 +0200)]

BUG/MINOR: acme: don't unlink from acme_ctx_destroy()

Unlinking the acme_ctx element from acme_ctx_destroy() requires to have
the element unlocked, because MT_LIST_DELETE() locks the element.

acme_ctx_destroy() frees the data from acme_ctx with the ctx still
linked and unlocked, then lock to unlink. So there's a small risk of
accessing acme_ctx from somewhere else. The only way to do that would be
to use the `acme challenge_ready` CLI command at the same time.

Fix the issue by doing a mt_list_unlock_link() and a
mt_list_unlock_self() to unlink the element under the lock, then destroy
the element.

This must be backported in 3.2.

commit | commitdiff | tree

William Lallemand [Fri, 26 Sep 2025 14:11:06 +0000 (16:11 +0200)]

CI: github: build halog on the vtest job

halog was not built in the vtest job. Add it to vtest.yml to be able to
track build issues on push.

commit | commitdiff | tree

William Lallemand [Fri, 26 Sep 2025 13:58:49 +0000 (15:58 +0200)]

BUILD: halog: misleading indentation in halog.c

admin/halog/halog.c: In function 'filter_count_url':
admin/halog/halog.c:1685:9: error: this 'if' clause does not guard... [-Werror=misleading-indentation]
1685 |         if (unlikely(!ustat))
      |         ^~
admin/halog/halog.c:1687:17: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the 'if'
1687 |                 if (unlikely(!ustat)) {
      |                 ^~

This patch fixes the indentation.

Must be backported where fbd0fb20a22 ("BUG/MINOR: halog: Add OOM checks
for calloc() in filter_count_srv_status() and filter_count_url()") was
backported.

commit | commitdiff | tree

Chris Staite [Wed, 24 Sep 2025 21:21:43 +0000 (22:21 +0100)]

MINOR: backend: srv_is_up converter

There is currently an srv_queue converter which is capable of taking the
output of a dynamic name and determining the queue length for a given
server. In addition there is a sample fetcher for whether a server is
currently up. This simply combines the two such that srv_is_up can be
used as a converter too.

Future work might extend this to other sample fetchers for servers, but
this is probably the most useful for acl routing.

commit | commitdiff | tree

Chris Staite [Fri, 26 Sep 2025 08:04:28 +0000 (09:04 +0100)]

MINOR: backend: srv_queue helper

In preparation of providing further server converters, split the code
for finding the server from the sample out.

Additionally, update the documentation for srv_queue converter to note
security concerns.

commit | commitdiff | tree

William Lallemand [Fri, 26 Sep 2025 08:34:35 +0000 (10:34 +0200)]

BUILD: acme: fix false positive null pointer dereference

src/acme.c: In function ‘cfg_parse_acme_vars_provider’:
src/acme.c:471:9: error: potential null pointer dereference [-Werror=null-dereference]
471 | free(*dst);
| ^~~~~~~~~~

gcc13 on ubuntu 24.04 detects a false positive when building
3e72a9f ("MINOR: acme: provider-name for dpapi sink").
Indeed dst can't be NULL. Clarify the code so gcc don't complain
anymore.

commit | commitdiff | tree

William Lallemand [Fri, 26 Sep 2025 08:09:27 +0000 (10:09 +0200)]

MINOR: acme: provider-name for dpapi sink

Like "acme-vars", the "provider-name" in the acme section is used in
case of DNS-01 challenge and is sent to the dpapi sink.

This is used to pass the name of a DNS provider in order to chose the
DNS API to use.

This patch implements the cfg_parse_acme_vars_provider() which parses
either acme-vars or provider-name options and escape their strings.

Example:

     $ ( echo "@@1 show events dpapi -w -0"; cat - ) | socat /tmp/master.sock -  | cat -e
     <0>2025-09-18T17:53:58.831140+02:00 acme deploy foobpar.pem thumbprint gDvbPL3w4J4rxb8gj20mGEgtuicpvltnTl6j1kSZ3vQ$
     acme-vars "var1=foobar\"toto\",var2=var2"$
     provider-name "godaddy"$
     {$
       "identifier": {$
         "type": "dns",$
         "value": "example.com"$
       },$
       "status": "pending",$
       "expires": "2025-09-25T14:41:57Z",$
       [...]

commit | commitdiff | tree

William Lallemand [Fri, 26 Sep 2025 07:22:55 +0000 (09:22 +0200)]

BUG/MEDIUM: ssl: ca-file directory mode must read every certificates of a file

The httpclient is configured with @system-ca by default, which uses the
directory returned by X509_get_default_cert_dir().

On debian/ubuntu systems, this directory contains multiple certificate
files that are loaded successfully. However it seems that on other
systems the files in this directory is the direct result of
ca-certificates instead of its source. Meaning that you would only have
a bundle file with every certificates in it.

The loading was not done correctly in case of directory loading, and was
only loading the first certificate of each file.

This patch fixes the issue by using X509_STORE_load_locations() on each
file from the scandir instead of trying to load it manually with BIO.

Not that we can't use X509_STORE_load_locations with the `dir` argument,
which would be simpler, because it uses X509_LOOKUP_hash_dir() which
requires a directory in hash form. That wouldn't be suited for this use
case.

Must be backported in every stable branches.

Fix issue #3137.

commit | commitdiff | tree

William Lallemand [Wed, 17 Sep 2025 13:58:54 +0000 (15:58 +0200)]

CI: github: add curl+ech build into openssl-ech job

Build a curl binary with the ECH function linked with our openssl+ech
library.

commit | commitdiff | tree

William Lallemand [Wed, 17 Sep 2025 13:48:16 +0000 (15:48 +0200)]

CI: scripts: build curl with ECH support

Add a script to build curl with ECH support, to specify the path of the
openssl+ECH library, you should set the SSL_LIB variable with the prefix
of the library.

Example:
SSL_LIB=/opt/openssl-ech CURL_DESTDIR=/opt/curl-ech/ ./build-curl.sh

commit | commitdiff | tree

Christopher Faulet [Thu, 25 Sep 2025 13:21:04 +0000 (15:21 +0200)]

BUG/MINOR: pattern: Fix pattern lookup for map with opt@ prefix

When we look for a map file reference, the file@ prefix is removed because
if may be omitted. The same is true with opt@ prefix. However this case was
not properly performed in pat_ref_lookup(). Let's do so.

This patch must be backported as far as 3.0.

commit | commitdiff | tree

William Lallemand [Thu, 25 Sep 2025 13:14:31 +0000 (15:14 +0200)]

CLEANUP: acme: acme_will_expire() uses acme_schedule_date()

Date computation between acme_will_expire() and acme_schedule_date() are
the same. Call acme_schedule_date() from acme_will_expire() and put the
functions as static. The patch also move the functions in the right
order.

commit | commitdiff | tree

William Lallemand [Thu, 25 Sep 2025 12:59:19 +0000 (14:59 +0200)]

BUG/MINOR: acme: possible overflow in acme_will_expire()

acme_will_expire() computes the schedule date using notAfter and
notBefore from the certificate. However notBefore could be greater than
notAfter and could result in an overflow.

This is unlikely to happen and would mean an incorrect certificate.

This patch fixes the issue by checking that notAfter > notBefore.

It also replace the int type by a time_t to avoid overflow on 64bits
architecture which is also unlikely to happen with certificates.

`(date.tv_sec + diff > notAfter)` was also replaced by `if (notAfter -
diff <= date.tv_sec)` to avoid an overflow.

Fix issue #3135.

Need to be backported to 3.2.

commit | commitdiff | tree

William Lallemand [Thu, 25 Sep 2025 12:39:31 +0000 (14:39 +0200)]

BUG/MINOR: acme: possible overflow on scheduling computation

acme_schedule_date() computes the schedule date using notAfter and
notBefore from the certificate. However notBefore could be greater than
notAfter and could result in an overflow.

This is unlikely to happen and would mean an incorrect certificate.

This patch fixes the issue by checking that notAfter > notBefore.

It also replace the int type by a time_t to avoid overflow on 64bits
architecture which is also unlikely to happen with certificates.

Fix issue #3136.

Need to be backported to 3.2.

commit | commitdiff | tree

Christopher Faulet [Thu, 25 Sep 2025 08:03:41 +0000 (10:03 +0200)]

BUG/MINOR: pattern: Properly flag virtual maps as using samples

When a map file is load, internally, the pattern reference is flagged as
based on a sample. However it is not performed for virtual maps. This flag
is only used during startup to check the map compatibility when it used at
different places. At runtime this does not change anything. But errors can
be triggered during configuration parsing. For instance, the following valid
config will trigger an error:

http-request set-map(virt@test) foo bar if !{ str(foo),map(virt@test) -m found }
http-request set-var(txn.foo) str(foo),map(virt@test)

The fix is quite obvious. PAT_REF_SMP flag must be set for virtual map as
any other map.

A workaround is to use optional map (opt@...) by checking the map id cannot
reference an existing file.

This patch must be backported as far as 3.0.

commit | commitdiff | tree

Christopher Faulet [Thu, 18 Sep 2025 07:08:11 +0000 (09:08 +0200)]

BUG/MINOR: compression: Test payload size only if content-length is specified

When a minimum size is defined to performe the comression, the message
payload size is tested. To do so, information from the HTX message a used to
determine the message length. However it is performed regardless the payload
length is fully known or not. Concretely, the test must on be performed when
a content-length value was speficied or when the message was fully received
(EOM flag set). Otherwise, we are unable to really determine the real
payload length.

Because of this bug, compression may be skipped for a large chunked message
because the first chunks received are too small. But this does not mean the
whole message is small.

This patch must be backported to 3.2.

commit | commitdiff | tree

Olivier Houchard [Thu, 11 Sep 2025 16:22:34 +0000 (18:22 +0200)]

BUG/MEDIUM: stick-tables: Don't let table_process_entry() handle refcnt

Instead of having table_process_entry() decrement the session's ref
counter, do it outside, from the caller. Some were missed, such as when
an action was invalid, which would lead to the ref counter not being
decremented, and the session not being destroyable.
It makes more sense to do that from the caller, who just obtained the
ref counter, anyway.
This should be backporter up to 2.8.

commit | commitdiff | tree

Ilia Shipitsin [Wed, 17 Sep 2025 20:26:24 +0000 (22:26 +0200)]

CI: move VTest preparation & friends to dedicated composite action

reference: https://docs.github.com/en/actions/tutorials/create-actions/create-a-composite-action

preparing coredump limits, installing VTest are now served by dedicated
composite action

commit | commitdiff | tree

William Lallemand [Mon, 22 Sep 2025 17:14:54 +0000 (19:14 +0200)]

BUG/MINOR: acme/cli: wrong description for "acme challenge_ready"

The "acme challenge_ready" command mistakenly use the description of the
"acme status" command. This patch adds the right description.

Must be backported to 3.2.

commit | commitdiff | tree

William Lallemand [Fri, 19 Sep 2025 15:13:24 +0000 (17:13 +0200)]

MINOR: acme: check acme-vars allocation during escaping

Handle allocation properly during acme-vars parsing.
Check if we have a allocation failure in both the malloc and the
realloc and emits an error if that's the case.

commit | commitdiff | tree

William Lallemand [Thu, 18 Sep 2025 15:54:27 +0000 (17:54 +0200)]

MINOR: acme: acme-vars allow to pass data to the dpapi sink

In the case of the dns-01 challenge, the agent that handles the
challenge might need some extra information which depends on the DNS
provider.

This patch introduces the "acme-vars" option in the acme section, which
allows to pass these data to the dpapi sink. The double quotes will be
escaped when printed in the sink.

Example:

    global
        setenv VAR1 'foobar"toto"'

    acme LE
        directory https://acme-staging-v02.api.letsencrypt.org/directory
        challenge DNS-01
        acme-vars "var1=${VAR1},var2=var2"

Would output:

    $ ( echo "@@1 show events dpapi -w -0"; cat - ) | socat /tmp/master.sock -  | cat -e
    <0>2025-09-18T17:53:58.831140+02:00 acme deploy foobpar.pem thumbprint gDvbPL3w4J4rxb8gj20mGEgtuicpvltnTl6j1kSZ3vQ$
    acme-vars "var1=foobar\"toto\",var2=var2"$
    {$
      "identifier": {$
        "type": "dns",$
        "value": "example.com"$
      },$
      "status": "pending",$
      "expires": "2025-09-25T14:41:57Z",$
      [...]

commit | commitdiff | tree

Christopher Faulet [Fri, 19 Sep 2025 12:51:32 +0000 (14:51 +0200)]

BUG/MEDIUM: http-client: Fix the test on the response start-line

The commit 88aa7a780 ("MINOR: http-client: Trigger an error if first
response block isn't a start-line") introduced a bug. From an endpoint, an
applet or a mux, the <first> index must never be used. It is reserved to the
HTTP analyzers. From endpoint, this value may be undefined or just point on
any other block that the first one. Instead we must always get the head
block.

In taht case, to be sure the first HTX block in a response is a start-line,
we must use htx_get_head_type() function instead of htx_get_first_type().
Otherwise, we can trigger an error while the response is in fact properly
formatted.

It is a 3.3-speific issue. cNo backport needed.

commit | commitdiff | tree

Aurelien DARRAGON [Thu, 18 Sep 2025 14:28:29 +0000 (16:28 +0200)]

MEDIUM: stats: consider that shared stats pointers may be NULL

This patch looks huge, but it has a very simple goal: protect all
accessed to shared stats pointers (either read or writes), because
we know consider that these pointers may be NULL.

The reason behind this is despite all precautions taken to ensure the
pointers shouldn't be NULL when not expected, there are still corner
cases (ie: frontends stats used on a backend which no FE cap and vice
versa) where we could try to access a memory area which is not
allocated. Willy stumbled on such cases while playing with the rings
servers upon connection error, which eventually led to process crashes
(since 3.3 when shared stats were implemented)

Also, we may decide later that shared stats are optional and should
be disabled on the proxy to save memory and CPU, and this patch is
a step further towards that goal.

So in essence, this patch ensures shared stats pointers are always
initialized (including NULL), and adds necessary guards before shared
stats pointers are de-referenced. Since we already had some checks
for backends and listeners stats, and the pointer address retrieval
should stay in cpu cache, let's hope that this patch doesn't impact
stats performance much.

commit | commitdiff | tree

Aurelien DARRAGON [Thu, 18 Sep 2025 11:29:53 +0000 (13:29 +0200)]

BUG/MEDIUM: sink: fix unexpected double postinit of sink backend

Willy experienced an unexpected behavior with the config below:

    global
        stats socket :1514

    ring buf1
        server srv1 127.0.0.1:1514

Indeed, haproxy would connect to the ring server twice since commit 23e5f18b
("MEDIUM: sink: change the sink mode type to PR_MODE_SYSLOG"), and one of the
connection would report errors.

The reason behind is is, despite the above commit saying no change of behavior
is expected, with the sink forward_px proxy now being set with PR_MODE_SYSLOG,
postcheck_log_backend() was being automatically executed in addition to the
manual cfg_post_parse_ring() function for each "ring" section. The consequence
is that sink_finalize() was called twice for a given "ring" section, which
means the connection init would be triggered twice.. which in turn resulted in
the behavior described above, plus possible unexpected side-effects.

To fix the issue, when we create the forward_px proxy, we now set the
PR_CAP_INT capability on it to tell haproxy not to automatically manage the
proxy (ie: to skip the automatic log backend postinit), because we are about
to manually manage the proxy from the sink API.

No backport needed, this bug is specific to 3.3

commit | commitdiff | tree

Willy Tarreau [Thu, 18 Sep 2025 13:23:53 +0000 (15:23 +0200)]

OPTIM: ring: avoid reloading the tail_ofs value before the CAS in ring_write()

The load followed by the CAS seem to cause two bus cycles, one to
retrieve the cache line in shared state and a second one to get
exclusive ownership of it. Tests show that on x86 it's much better
to just rely on the previous value and preset it to zero before
entering the loop. We just mask the ring lock in case of failure
so as to challenge it on next iteration and that's done.

This little change brings 2.3% extra performance (11.34M msg/s) on
a 64-core AMD.

commit | commitdiff | tree

Willy Tarreau [Thu, 18 Sep 2025 13:08:12 +0000 (15:08 +0200)]

OPTIM: ring: check the queue's owner using a CAS on x86

In the loop where the queue's leader tries to get the tail lock,
we also need to check if another thread took ownership of the queue
the current thread is currently working for. This is currently done
using an atomic load.

Tests show that on x86, using a CAS for this is much more efficient
because it allows to keep the cache line in exclusive state for a
few more cycles that permit the queue release call after the loop
to be done without having to wait again. The measured gain is +5%
for 128 threads on a 64-core AMD system (11.08M msg/s vs 10.56M).
However, ARM loses about 1% on this, and we cannot afford that on
machines without a fast CAS anyway, so the load is performed using
a CAS only on x86_64. It might not be as efficient on low-end models
but we don't care since they are not the ones dealing with high
contention.

commit | commitdiff | tree

Willy Tarreau [Thu, 18 Sep 2025 13:01:29 +0000 (15:01 +0200)]

OPTIM: ring: always relax in the ring lock and leader wait loop

Tests have shown that AMD systems really need to use a cpu_relax()
in these two loops. The performance improves from 10.03 to 10.56M
messages per second (+5%) on a 128-thread system, without affecting
intel nor ARM, so let's do this.

commit | commitdiff | tree

Willy Tarreau [Thu, 18 Sep 2025 12:58:38 +0000 (14:58 +0200)]

CLEANUP: ring: rearrange the wait loop in ring_write()

The loop is constructed in a complicated way with a single break
statement in the middle and many continue statements everywhere,
making it hard to better factor between variants. Let's first
reorganize it so as to make it easier to escape when the ring
tail lock is obtained. The sequence of instrucitons remains the
same, it's only better organized.

commit | commitdiff | tree

Willy Tarreau [Thu, 18 Sep 2025 07:07:35 +0000 (09:07 +0200)]

OPTIM: sink: don't waste time calling sink_announce_dropped() if busy

If we see that another thread is already busy trying to announce the
dropped counter, there's no point going there, so let's just skip all
that operation from sink_write() and avoid disturbing the other thread.
This results in a boost from 244 to 262k req/s.

commit | commitdiff | tree

Willy Tarreau [Thu, 18 Sep 2025 06:38:34 +0000 (08:38 +0200)]

OPTIM: sink: reduce contention on sink_announce_dropped()

perf top shows that sink_announce_dropped() consumes most of the CPU
on a 128-thread x86 system. Digging further reveals that the atomic
fetch_or() on the dropped field used to detect the presence of another
thread is entirely responsible for this. Indeed, the compiler implements
it using a CAS that loops without relaxing and makes all threads wait
until they can synchronize on this one, only to discover later that
another thread is there and they need to give up.

Let's just replace this with a hand-crafted CAS loop that will detect
*before* attempting the CAS if another thread is there. Doing so
achieves the same goal without forcing threads to agree. With this
simple change, the sustained request rate on h1 with all traces on
bumped from 110k/s to 244k/s!

This should be backported to stable releases where it's often needed
to help debugging.

commit | commitdiff | tree

Willy Tarreau [Thu, 18 Sep 2025 06:26:50 +0000 (08:26 +0200)]

MINOR: trace: don't call strlen() on the function's name

Currently there's a small mistake in the way the trace function and
macros. The calling function name is known as a constant until the
macro and passed as-is to the __trace() function. That one needs to
know its length and will call ist() on it, resulting in a real call
to strlen() while that length was known before the call. Let's use
an ist instead of a const char* for __trace() and __trace_enabled()
so that we can now completely avoid calling strlen() during this
operation. This has significantly reduced the importance of
__trace_enabled() in perf top.

commit | commitdiff | tree

Willy Tarreau [Thu, 18 Sep 2025 06:02:59 +0000 (08:02 +0200)]

MINOR: trace: don't call strlen() on the thread-id numeric encoding

In __trace(), we're making an integer for the thread id but this one
is passed through strlen() in the call to ist() because it's not a
constant. We do know that it's exactly 3 chars long so we can manage
this using ist2() and pass it the length instead in order to reduce
the number of calls to strlen().

Also let's note that the thread number will no longer be numeric for
thread numbers above 100.

commit | commitdiff | tree

Willy Tarreau [Wed, 17 Sep 2025 16:45:13 +0000 (18:45 +0200)]

BUG/MEDIUM: ring: invert the length check to avoid an int overflow

Vincent Gramer reported in GH issue #3125 a case of crash on a BUG_ON()
condition in the rings. What happens is that a message that is one byte
less than the maximum ring size is emitted, and it passes all the checks,
but once inflated by the extra +1 for the refcount, it can no longer. But
the check was made based on message size compared to space left, except
that this space left can now be negative, which is a high positive for
size_t, so the check remained valid and triggered a BUG_ON() later.

Let's compute the size the other way around instead (i.e. current +
needed) since we can't have rings as large as half of the memory space
anyway, thus we have no risk of overflow on this one.

This needs to be backported to all versions supporting multi-threaded
rings (3.0 and above).

Thanks to Vincent for the easy and working reproducer.

commit | commitdiff | tree

Willy Tarreau [Wed, 17 Sep 2025 15:15:12 +0000 (17:15 +0200)]

MINOR: server: add the "cc" keyword to set the TCP congestion controller

It is possible on at least Linux and FreeBSD to set the congestion control
algorithm to be used with outgoing connections, among the list of supported
and permitted ones. Let's expose this setting with "cc". Unknown or
forbidden algorithms will be ignored and the default one will continue to
be used.

commit | commitdiff | tree

Willy Tarreau [Wed, 17 Sep 2025 15:03:42 +0000 (17:03 +0200)]

MINOR: listener: add the "cc" bind keyword to set the TCP congestion controller

It is possible on at least Linux and FreeBSD to set the congestion control
algorithm to be used with incoming connections, among the list of supported
and permitted ones. Let's expose this setting with "cc". Permission issues
might be reported (as warnings).

commit | commitdiff | tree

Ben Kallus [Sat, 13 Sep 2025 12:26:39 +0000 (14:26 +0200)]

IMPORT: ebtree: replace hand-rolled offsetof to avoid UB

The C standard specifies that it's undefined behavior to dereference
NULL (even if you use & right after). The hand-rolled offsetof idiom
&(((s*)NULL)->f) is thus technically undefined. This clutters the
output of UBSan and is simple to fix: just use the real offsetof when
it's available.

Note that there's no clear statement about this point in the spec,
only several points which together converge to this:

- From N3220, 6.5.3.4:
  A postfix expression followed by the -> operator and an identifier
  designates a member of a structure or union object. The value is
  that of the named member of the object to which the first expression
  points, and is an lvalue.

- From N3220, 6.3.2.1:
  An lvalue is an expression (with an object type other than void) that
  potentially designates an object; if an lvalue does not designate an
  object when it is evaluated, the behavior is undefined.

- From N3220, 6.5.4.4 p3:
  The unary & operator yields the address of its operand. If the
  operand has type "type", the result has type "pointer to type". If
  the operand is the result of a unary * operator, neither that operator
  nor the & operator is evaluated and the result is as if both were
  omitted, except that the constraints on the operators still apply and
  the result is not an lvalue. Similarly, if the operand is the result
  of a [] operator, neither the & operator nor the unary * that is
  implied by the [] is evaluated and the result is as if the & operator
  were removed and the [] operator were changed to a + operator.

=> In short, this is saying that C guarantees these identities:
    1. &(*p) is equivalent to p
    2. &(p[n]) is equivalent to p + n

As a consequence, &(*p) doesn't result in the evaluation of *p, only
the evaluation of p (and similar for []). There is no corresponding
special carve-out for ->.

See also: https://pvs-studio.com/en/blog/posts/cpp/0306/

After this patch, HAProxy can run without crashing after building w/
clang-19 -fsanitize=undefined -fno-sanitize=function,alignment

This is ebtree commit bd499015d908596f70277ddacef8e6fa998c01d5.
Signed-off-by: Willy Tarreau <w@1wt.eu>
This is ebtree commit 5211c2f71d78bf546f5d01c8d3c1484e868fac13.

commit | commitdiff | tree

Willy Tarreau [Sat, 13 Sep 2025 12:21:54 +0000 (14:21 +0200)]

IMPORT: ebtree: add a definition of offsetof()

We'll use this to improve the definition of container_of(). Let's define
it if it does not exist. We can rely on __builtin_offsetof() on recent
enough compilers.

This is ebtree commit 1ea273e60832b98f552b9dbd013e6c2b32113aa5.
Signed-off-by: Willy Tarreau <w@1wt.eu>
This is ebtree commit 69b2ef57a8ce321e8de84486182012c954380401.

commit | commitdiff | tree

Ben Kallus [Sat, 13 Sep 2025 12:00:03 +0000 (14:00 +0200)]

IMPORT: ebtree: Fix UB from clz(0)

From 'man gcc': passing 0 as the argument to "__builtin_ctz" or
"__builtin_clz" invokes undefined behavior. This triggers UBsan
in HAProxy.

[wt: tested in treebench and verified not to cause any performance
regression with opstime-u32 nor stress-u32]
Signed-off-by: Willy Tarreau <w@1wt.eu>
This is ebtree commit 8c29daf9fa6e34de8c7684bb7713e93dcfe09029.
Signed-off-by: Willy Tarreau <w@1wt.eu>
This is ebtree commit cf3b93736cb550038325e1d99861358d65f70e9a.

commit | commitdiff | tree

Willy Tarreau [Wed, 17 Sep 2025 12:14:38 +0000 (14:14 +0200)]

IMPORT: ebst: use prefetching in lookup() and insert()

While the previous optimizations couldn't be preserved due to the
possibility of out-of-bounds accesses, at least the prefetch is useful.
A test on treebench shows that for 64k short strings, the lookup time
falls from 276 to 199ns per lookup (28% savings), and the insert falls
from 311 to 296ns (4.9% savings), which are pretty respectable, so
let's do this.

This is ebtree commit b44ea5d07dc1594d62c3a902783ed1fb133f568d.

commit | commitdiff | tree

Willy Tarreau [Sat, 5 Jul 2025 19:57:33 +0000 (21:57 +0200)]

IMPORT: ebtree: only use __builtin_prefetch() when supported

It looks like __builtin_prefetch() appeared in gcc-3.1 as there's no
mention of it in 3.0's doc. Let's replace it with eb_prefetch() which
maps to __builtin_prefetch() on supported compilers and falls back to
the usual do{}while(0) on other ones. It was tested to properly build
with tcc as well as gcc-2.95.

This is ebtree commit 7ee6ede56a57a046cb552ed31302b93ff1a21b1a.

commit | commitdiff | tree

Willy Tarreau [Thu, 12 Jun 2025 22:13:06 +0000 (00:13 +0200)]

IMPORT: eb32/64: optimize insert for modern CPUs

Similar to previous patches, let's improve the insert() descent loop to
avoid discovering mandatory data too late. The change here is even
simpler than previous ones, a prefetch was installed and troot is
calculated before last instruction in a speculative way. This was enough
to gain +50% insertion rate on random data.

This is ebtree commit e893f8cc4d44b10f406b9d1d78bd4a9bd9183ccf.

commit | commitdiff | tree

Willy Tarreau [Sun, 8 Jun 2025 09:50:59 +0000 (11:50 +0200)]

IMPORT: ebmb: optimize the lookup for modern CPUs

This is the same principles as for the latest improvements made on
integer trees. Applying the same recipes made the ebmb_lookup()
function jump from 10.07 to 12.25 million lookups per second on a
10k random values tree (+21.6%).

It's likely that the ebmb_lookup_longest() code could also benefit
from this, though this was neither explored nor tested.

This is ebtree commit a159731fd6b91648a2fef3b953feeb830438c924.

commit | commitdiff | tree

Willy Tarreau [Sun, 8 Jun 2025 17:51:49 +0000 (19:51 +0200)]

IMPORT: eb32/eb64: place an unlikely() on the leaf test

In the loop we can help the compiler build slightly more efficient code
by placing an unlikely() around the leaf test. This shows a consistent
0.5% performance gain both on eb32 and eb64.

This is ebtree commit 6c9cdbda496837bac1e0738c14e42faa0d1b92c4.

commit | commitdiff | tree

Willy Tarreau [Sun, 8 Jun 2025 17:47:02 +0000 (19:47 +0200)]

IMPORT: eb32: drop the now useless node_bit variable

This one was previously used to preload from the node and keep a copy
in a register on i386 machines with few registers. With the new more
optimal code it's totally useless, so let's get rid of it. By the way
the 64 bit code didn't use that at all already.

This is ebtree commit 1e219a74cfa09e785baf3637b6d55993d88b47ef.

commit | commitdiff | tree

Willy Tarreau [Sat, 7 Jun 2025 11:12:40 +0000 (13:12 +0200)]

IMPORT: eb32/eb64: use a more parallelizable check for lack of common bits

Instead of shifting the XOR value right and comparing it to 1, which
roughly requires 2 sequential instructions, better test if the XOR has
any bit above the current bit, which means any bit set among those
strictly higher, or in other words that XOR & (-bit << 1) is non-zero.
This is one less instruction in the fast path and gives another nice
performance gain on random keys (in million lookups/s):

    eb32   1k:  33.17 -> 37.30   +12.5%
          10k:  15.74 -> 17.08   +8.51%
         100k:   8.00 ->  9.00   +12.5%
    eb64   1k:  34.40 -> 38.10   +10.8%
          10k:  16.17 -> 17.10   +5.75%
         100k:   8.38 ->  8.87   +5.85%

This is ebtree commit c942a2771758eed4f4584fe23cf2914573817a6b.

commit | commitdiff | tree

Willy Tarreau [Sat, 7 Jun 2025 12:36:16 +0000 (14:36 +0200)]

IMPORT: eb32/eb64: reorder the lookup loop for modern CPUs

The current code calculates the next troot based on a calculation.
This was efficient when the algorithm was developed many years ago
on K6 and K7 CPUs running at low frequencies with few registers and
limited branch prediction units but nowadays with ultra-deep pipelines
and high latency memory that's no longer efficient, because the CPU
needs to have completed multiple operations before knowing which
address to start fetching from. It's sad because we only have two
branches each time but the CPU cannot know it. In addition, the
calculation is performed late in the loop, which does not help the
address generation unit to start prefetching next data.

Instead we should help the CPU by preloading data early from the node
and calculing troot as soon as possible. The CPU will be able to
postpone that processing until the dependencies are available and it
really needs to dereference it. In addition we must absolutely avoid
serializing instructions such as "(a >> b) & 1" because there's no
way for the compiler to parallelize that code nor for the CPU to pre-
process some early data.

What this patch does is relatively simple:

  - we try to prefetch the next two branches as soon as the
    node is known, which will help dereference the selected node in
    the next iteration; it was shown that it only works with the next
    changes though, otherwise it can reduce the performance instead.
    In practice the prefetching will start a bit later once the node
    is really in the cache, but since there's no dependency between
    these instructions and any other one, we let the CPU optimize as
    it wants.

  - we preload all important data from the node (next two branches,
    key and node.bit) very early even if not immediately needed.
    This is cheap, it doesn't cause any pipeline stall and speeds
    up later operations.

  - we pre-calculate 1<<bit that we assign into a register, so as
    to avoid serializing instructions when deciding which branch to
    take.

  - we assign the troot based on a ternary operation (or if/else) so
    that the CPU knows upfront the two possible next addresses without
    waiting for the end of a calculation and can prefetch their contents
    every time the branch prediction unit guesses right.

Just doing this provides significant gains at various tree sizes on
random keys (in million lookups per second):

  eb32   1k:  29.07 -> 33.17  +14.1%
        10k:  14.27 -> 15.74  +10.3%
       100k:   6.64 ->  8.00  +20.5%
  eb64   1k:  27.51 -> 34.40  +25.0%
        10k:  13.54 -> 16.17  +19.4%
       100k:   7.53 ->  8.38  +11.3%

The performance is now much closer to the sequential keys. This was
done for all variants ({32,64}{,i,le,ge}).

Another point, the equality test in the loop improves the performance
when looking up random keys (since we don't need to reach the leaf),
but is counter-productive for sequential keys, which can gain ~17%
without that test. However sequential keys are normally not used with
exact lookups, but rather with lookup_ge() that spans a time frame,
and which does not have that test for this precise reason, so in the
end both use cases are served optimally.

It's interesting to note that everything here is solely based on data
dependencies, and that trying to perform *less* operations upfront
always ends up with lower performance (typically the original one).

This is ebtree commit 05a0613e97f51b6665ad5ae2801199ad55991534.

commit | commitdiff | tree

Willy Tarreau [Wed, 11 Jun 2025 20:00:18 +0000 (22:00 +0200)]

IMPORT: ebtree: delete unusable ebpttree.c

Since commit 21fd162 ("[MEDIUM] make ebpttree rely solely on eb32/eb64
trees") it was no longer used and no longer builds. The commit message
mentions that the file is no longer needed, probably that a rebase failed
and left the file there.

This is ebtree commit fcfaf8df90e322992f6ba3212c8ad439d3640cb7.

commit | commitdiff | tree

Aurelien DARRAGON [Tue, 16 Sep 2025 16:44:02 +0000 (18:44 +0200)]

DOC: internals: document the shm-stats-file format/mapping

Add some documentation about shm stats file structure to help writing
tools that can parse the file to use the shared stats counters.

This file was written for shm stats file version 1.0 specifically,
it may need to be updated when the shm stats file structure changes
in the future.

commit | commitdiff | tree

Aurelien DARRAGON [Tue, 16 Sep 2025 16:30:51 +0000 (18:30 +0200)]

MINOR: counters: document that tg shared counters are tied to shm-stats-file mapping

Let's explicitly mention that fe_counters_shared_tg and
be_counters_shared_tg structs are embedded in shm_stats_file_object
struct so any change in those structs will result in shm stats file
incompatibility between processes, thus extra precaution must be
taken when making changes to them.

Note that the provisionning made in shm_stats_file_object struct could
be used to add members to {fe,be}_counters_shared_tg without changing
shm_stats_file_object struct size if needed in order to preserve
shm stats file version.

commit | commitdiff | tree

Aurelien DARRAGON [Tue, 16 Sep 2025 06:17:03 +0000 (08:17 +0200)]

CLEANUP: log: remove deadcode in px_parse_log_steps()

When logsteps proxy storage was migrated from eb nodes to bitmasks in
6a92b14 ("MEDIUM: log/proxy: store log-steps selection using a bitmask,
not an eb tree"), some unused eb node related code was left over in
px_parse_log_steps()

Not only this code is unused, it also resulted in wasted memory since
an eb node was allocated for nothing.

This should fix GH #3121

commit | commitdiff | tree

Willy Tarreau [Tue, 16 Sep 2025 09:49:01 +0000 (11:49 +0200)]

BUG/MEDIUM: pattern: fix possible infinite loops on deletion (try 2)

Commit e36b3b60b3 ("MEDIUM: migrate the patterns reference to cebs_tree")
changed the construction of the loops used to look up matching nodes, and
since we don't need two elements anymore, the "continue" statement now
loops on the same element when deleting. Let's fix this to make sure it
passes through the next one.

While this bug is 3.3 only, it turns out that 3.2 is also affected by
the incorrect loop construct in pat_ref_set_from_node(), where it's
possible to run an infinite loop since commit 010c34b8c7 ("MEDIUM:
pattern: consider gen_id in pat_ref_set_from_node()") due to the
"continue" statement being placed before the ebmb_next_dup() call.

As such the relevant part of this fix (pat_ref_set_from_elt) will
need to be backported to 3.2.

commit | commitdiff | tree

Willy Tarreau [Tue, 16 Sep 2025 14:27:21 +0000 (16:27 +0200)]

Revert "BUG/MEDIUM: pattern: fix possible infinite loops on deletion"

This reverts commit 359a829ccb8693e0b29808acc0fa7975735c0353.
The fix is neither sufficient nor correct (it triggers ASAN). Better
redo it cleanly rather than accumulate invalid fixes.

commit | commitdiff | tree

William Lallemand [Tue, 16 Sep 2025 13:32:08 +0000 (15:32 +0200)]

CI: scripts: mkdir BUILDSSL_TMPDIR

Creates the BUILDSSL_TMPDIR at the beginning of the script instead of
having to create it in each download functions

commit | commitdiff | tree

William Lallemand [Tue, 16 Sep 2025 10:01:23 +0000 (12:01 +0200)]

CI: github: add an OpenSSL + ECH job

The upcoming ECH feature need a patched OpenSSL with the "feature/ech"
branch.

This daily job launches an openssl build, as well as haproxy build with
reg-tests.

commit | commitdiff | tree

William Lallemand [Tue, 16 Sep 2025 09:50:34 +0000 (11:50 +0200)]

CI: scripts: add support for git in openssl builds

Add support for git releases downloaded from github in openssl builds:

- GIT_TYPE variable allow you to chose between "branch" or "commit"
- OPENSSL_VERSION variable supports a "git-" prefix
- "git-${commit_id}" is stored in .openssl_version instead of the branch
name for version comparison.

commit | commitdiff | tree

Willy Tarreau [Tue, 16 Sep 2025 09:49:01 +0000 (11:49 +0200)]

BUG/MEDIUM: pattern: fix possible infinite loops on deletion

Commit e36b3b60b3 ("MEDIUM: migrate the patterns reference to cebs_tree")
changed the construction of the loops used to look up matching nodes, and
since we don't need two elements anymore, the "continue" statement now
loops on the same element when deleting. Let's fix this to make sure it
passes through the next one.

No backport is needed, this is only 3.3.

commit | commitdiff | tree

Willy Tarreau [Tue, 16 Sep 2025 08:47:52 +0000 (10:47 +0200)]

CLEANUP: vars: use the item API for the variables trees

The variables trees use the immediate cebtree API, better use the
item one which is more expressive and safer. The "node" field was
renamed to "name_node" to avoid any ambiguity.

commit | commitdiff | tree

Willy Tarreau [Tue, 16 Sep 2025 08:41:19 +0000 (10:41 +0200)]

CLEANUP: tools: use the item API for the file names tree

The file names tree uses the immediate cebtree API, better use the
item one which is more expressive and safer.

commit | commitdiff | tree

Willy Tarreau [Fri, 12 Sep 2025 16:12:18 +0000 (18:12 +0200)]

MEDIUM: connection: reintegrate conn_hash_node into connection

Previously the conn_hash_node was placed outside the connection due
to the big size of the eb64_node that could have negatively impacted
frontend connections. But having it outside also means that one
extra allocation is needed for each backend connection, and that one
memory indirection is needed for each lookup.

With the compact trees, the tree node is smaller (16 bytes vs 40) so
the overhead is much lower. By integrating it into the connection,
We're also eliminating one pointer from the connection to the hash
node and one pointer from the hash node to the connection (in addition
to the extra object bookkeeping). This results in saving at least 24
bytes per total backend connection, and only inflates connections by
16 bytes (from 240 to 256), which is a reasonable compromise.

Tests on a 64-core EPYC show a 2.4% increase in the request rate
(from 2.08 to 2.13 Mrps).

commit | commitdiff | tree

Willy Tarreau [Fri, 12 Sep 2025 13:28:33 +0000 (15:28 +0200)]

MEDIUM: connection: move idle connection trees to ceb64

Idle connection trees currently require a 56-byte conn_hash_node per
connection, which can be reduced to 32 bytes by moving to ceb64. While
ceb64 is theoretically slower, in practice here we're essentially
dealing with trees that almost always contain a single key and many
duplicates. In this case, ceb64 insert and lookup functions become
faster than eb64 ones because all duplicates are a list accessed in
O(1) while it's a subtree for eb64. In tests it is impossible to tell
the difference between the two, so it's worth reducing the memory
usage.

This commit brings the following memory savings to conn_hash_node
(one per backend connection), and to srv_per_thread (one per thread
and per server):

     struct       before  after  delta
  conn_hash_nodea   56     32     -24
  srv_per_thread    96     72     -24

The delicate part is conn_delete_from_tree(), because we need to
know the tree root the connection is attached to. But thanks to
recent cleanups, it's now clear enough (i.e. idle/safe/avail vs
session are easy to distinguish).

commit | commitdiff | tree

Willy Tarreau [Fri, 12 Sep 2025 14:47:12 +0000 (16:47 +0200)]

MINOR: connection: pass the thread number to conn_delete_from_tree()

We'll soon need to choose the server's root based on the connection's
flags, and for this we'll need the thread it's attached to, which is
not always the current one. This patch simply passes the thread number
from all callers. They know it because they just set the idle_conns
lock on it prior to calling the function.

commit | commitdiff | tree

Willy Tarreau [Fri, 12 Sep 2025 14:28:58 +0000 (16:28 +0200)]

CLEANUP: backend: use a single variable for removed in srv_cleanup_idle_conns()

Probably due to older code, there's a boolean variable used to set
another one which is then checked. Also the first check is made under
the lock, which is unnecessary. Let's simplify this and use a single
variable. This only makes the code clearer, it doesn't change the output
code.

commit | commitdiff | tree

Willy Tarreau [Fri, 12 Sep 2025 14:24:16 +0000 (16:24 +0200)]

MINOR: server: pass the server and thread to srv_migrate_conns_to_remove()

We'll need to have access to the srv_per_thread element soon from this
function, and there's no particular reason for passing it list pointers
so let's pass the server and the thread so that it is autonomous. It
also makes the calling code simpler.

commit | commitdiff | tree

Willy Tarreau [Fri, 12 Sep 2025 13:49:48 +0000 (15:49 +0200)]

CLEANUP: server: use eb64_entry() not ebmb_entry() to convert an eb64

There were a few leftovers from an earlier version of the conn_hash_node
that was using ebmb nodes. A few calls to ebmb_first() and ebmb_entry()
were still present while acting on an eb64 tree. These are harmless as
one is just eb_first() and the other container_of(), but it's confusing
so let's clean them up.

commit | commitdiff | tree

Willy Tarreau [Fri, 12 Sep 2025 13:30:30 +0000 (15:30 +0200)]

CLEANUP: backend: factor the connection lookup loop

The connection lookup loop is made of two identical blocks, one looking
in the idle or safe lists and the other one looking into the safe list
only. The second one is skipped if a connection was found or if the request
looks for a safe one (since already done). Also the two are slightly
different due to leftovers from earlier versions in that the second one
checks for safe connections and not the first one, and the second one
sets is_safe which is not used later.

Let's just rationalize all this by placing them in a loop which checks
first from the idle conns and second from the safe ones, or skips the
first step if the request wants a safe connection. This reduces the
code and shortens the time spent under the lock.

commit | commitdiff | tree

Willy Tarreau [Sun, 24 Aug 2025 10:38:18 +0000 (12:38 +0200)]

CLEANUP: proxy: slightly reorganize fields to plug some holes

The proxy struct has several small holes that deserved being plugged by
moving a few fields around. Now we're down to 3056 from 3072 previously,
and the remaining holes are small.

At the moment, compared to before this series, we're seeing these
sizes:

    type\size   7d554ca62   current  delta
    listener       752        704     -48  (-6.4%)
    server        4032       3840    -192  (-4.8%)
    proxy         3184       3056    -128  (-4%)
    stktable      3392       3328     -64  (-1.9%)

Configs with many servers have shrunk by about 4% in RAM and configs
with many proxies by about 3%.

Mirror of https://github.com/haproxy/haproxy.git

RSS Atom