git.ipfire.org Git - thirdparty/haproxy.git/log

]> git.ipfire.org Git - thirdparty/haproxy.git/log

projects / thirdparty / haproxy.git / log

summary | shortlog | log | commit | commitdiff | tree
first ⋅ prev ⋅ next

commit | commitdiff | tree

William Lallemand [Mon, 10 Feb 2025 14:07:05 +0000 (15:07 +0100)]

MEDIUM: initcall: allow to register mutiple post_section_parser per section

Before this patch, REGISTER_CONFIG_SECTION() allowed to register one and only
one callback (<post>) called after the parsing of a section.

It was limitating because you couldn't register a post callback from anywhere
else in the code.

This patch introduces the new REGISTER_CONFIG_SECTION_POST() macros which allows
to register a new post callback for a section keyword from anywhere.

This patch introduces the feature by allowing `struct cfg_section` entries that
does not have a `section_parser`, and then iterating on all cfg_section with a
post_section_parser for a keyword.

commit | commitdiff | tree

William Lallemand [Wed, 12 Feb 2025 11:29:48 +0000 (12:29 +0100)]

CLEANUP: mworker: "program" section does not have a post_section_parser anymore

The "program" section does not have a post_section_parser anymore so no
need to make an exception for it.

commit | commitdiff | tree

William Lallemand [Wed, 12 Feb 2025 11:31:11 +0000 (12:31 +0100)]

BUG/MINOR: mworker: post_section_parser for the last section in discovery

Previous patch 2c270a05f ("BUG/MINOR: mworker: section ignored in
discovery after a post_section_parser") needs an adjustment for the last
section of the file.

Indeed the post_section_parser of the last section must not be called in
discovery mode.

Must be backported in 3.1.

commit | commitdiff | tree

William Lallemand [Wed, 12 Feb 2025 11:09:05 +0000 (12:09 +0100)]

BUG/MINOR: mworker: section ignored in discovery after a post_section_parser

When a new section is discovered, the post_section_parser of the
previous section is called. However in the new master-worker mode the
discovery mode will skip the post_section_parser. But instead of
trying to parse the current section keyword after that, it would skip
completely the current line.

This is a minor bug since there isn't a lot of section with
post_section_parser, and not a lot of section to parse in discovery
mode.

But this could be reproduced like this:

global
        expose-deprecated-directives

resolvers res
parse-resolv-conf

program foo
        command sleep 10

program bar
       command sleep 10

Ths 'resolvers' section has a post_section_parser which will be ignored
in discovery mode with the consequence of ignoring the first program
section.

This must be backported in 3.1.

commit | commitdiff | tree

Amaury Denoyelle [Wed, 12 Feb 2025 09:55:51 +0000 (10:55 +0100)]

MINOR: quic: simplify length calculation for STREAM/CRYPTO frames

STREAM and CRYPTO frames have a similar encoding format. In particular,
both of them have a variable-length integer Length field just before the
frame payload.

It is complex to determine the optimal Length value before copying the
payload data in the remaining buffer space. As such, helper functions
were implemented to calculate this. However, CRYPTO and STREAM frames
encoding implementation were not completely aligned, which renders the
code harder to follow.

The purpose of this commit is to simplify CRYPTO and STREAM frames
encoding. First, a new helper quic_int_cap_length() is defined which is
useful to determine the optimal buffer room available if prefixed by a
variable-length integer as Length field. Then, processing of both CRYPTO
and STREAM frames is now nearly identical, based on this new helper
function. Functions max_available_room() and max_stream_data_size() are
now unused and are removed.

commit | commitdiff | tree

Amaury Denoyelle [Tue, 11 Feb 2025 13:35:52 +0000 (14:35 +0100)]

BUG/MINOR: quic: fix CRYPTO payload size calcul for encoding

Function max_stream_data_size() is used to determine the payload length
of a CRYPTO frame. It takes into account that the CRYPTO length field is
a variable length integer.

Implemented calcul was incorrect as it reserved too much space as a
frame header. This error is mostly due because max_stream_data_size()
reuses max_available_room() which also reserve space for a variable
length integer. This results in CRYPTO frames shorter of 1 to 2 bytes
than the maximum achievable value, which produces in the end datagram
shorter than the MTU.

Fix max_stream_data_size() implementation. It is now merely a wrapper on
max_available_room(). This ensures that CRYPTO frame encoding is now
properly optimized to use the MTU available.

This should be backported up to 2.6.

commit | commitdiff | tree

Amaury Denoyelle [Tue, 11 Feb 2025 13:34:57 +0000 (14:34 +0100)]

BUG/MINOR: quic: reserve length field for long header encoding

Long header packets have a mandatory Length field, which contains the
size of Packet number and payload, encoded as a variable-length integer.
Its value can thus only be determined after the payload size is known,
which depends on the remaining buffer space after this variable-length
field.

Packet payload are encoded in two steps. First, a list of input frames
is processed until the packet buffer is full. CRYPTO and STREAM frames
payload can be splitted if need to fill the buffer. Real encoding is
then performed as a second stage operation, first with Length field,
then with the selected frames themselves.

Before this patch, no space was reserved in the buffer for Length field
when attaching the frames to the packet. This could result in a error as
the packet payload would be too large for the remaining space.

In practice, this issue was rarely encounted, mostly as a side-effect
from another issue linked to CRYPTO frame encoding. Indeed, a wrong
calculation is performed on CRYPTO splitting, which results in frame
payload shorter by a few bytes than expected. This however ensured there
would be always enough room for the Length field and payload during
encoding. As CRYPTO frames are the only big enough content emitted with
a Long header packet, this renders the current issue mostly non
reproducible.

Fix the original issue by reserving some space for Length field prior to
frame payload calculation, using a maximum value based on the remaining
room space. Packet length is then reduced if needed when encoding is
performed, which ensures there is always enough room for the selected
frames.

Note that the other issue impacting CRYPTO frame encoding is not yet
fixed. This could result in datagrams with Long header packets not
completely extended to the full MTU. The issue will be addressed in
another patch.

This should be backported up to 2.6.

commit | commitdiff | tree

Willy Tarreau [Tue, 11 Feb 2025 16:24:19 +0000 (17:24 +0100)]

MAJOR: leastconn: postpone the server's repositioning under contention

When leastconn is used under many threads, there can be a lot of
contention on leastconn, because the same node has to be moved around
all the time (when picking it and when releasing it). In GH issue #2861
it was noticed that 46 threads out of 64 were waiting on the same lock
in fwlc_srv_reposition().

In such a case, the accuracy of the server's key becomes quite irrelevant
because nobody cares if the same server is picked twice in a row and the
next one twice again.

While other approaches in the past considered using a floating key to
avoid moving the server each time (which was not compatible with the
round-robin rule for equal keys), here a more drastic solution is needed.
What we're doing instead is that we turn this lock into a trylock. If we
can grab it, we do the job. If we can't, then we just wake up a server's
tasklet dedicated to this. That tasklet will then try again slightly
later, knowing that during this short time frame, the server's position
in the queue is slightly inaccurate. Note that any thread touching the
same server will also reposition it and save that work for next time.
Also if multiple threads wake the tasklet up, then that's fine, their
calls will be merged and a single lock will be taken in the end.

Testing this on a 24-core EPYC 74F3 showed a significant performance
boost from 382krps to 610krps. The performance profile reported by
perf top dropped from 43% to 2.5%:

Before:
  Overhead  Shared Object             Symbol
    43.46%  haproxy-master-inlineebo  [.] fwlc_srv_reposition
    21.20%  haproxy-master-inlineebo  [.] fwlc_get_next_server
     0.91%  haproxy-master-inlineebo  [.] process_stream
     0.75%  [kernel]                  [k] ice_napi_poll
     0.51%  [kernel]                  [k] tcp_recvmsg
     0.50%  [kernel]                  [k] ice_start_xmit
     0.50%  [kernel]                  [k] tcp_ack

After:
  Overhead  Shared Object             Symbol
    30.37%  haproxy                   [.] fwlc_get_next_server
     2.51%  haproxy                   [.] fwlc_srv_reposition
     1.91%  haproxy                   [.] process_stream
     1.46%  [kernel]                  [k] ice_napi_poll
     1.36%  [kernel]                  [k] tcp_recvmsg
     1.04%  [kernel]                  [k] tcp_ack
     1.00%  [kernel]                  [k] skb_release_data
     0.96%  [kernel]                  [k] ice_start_xmit
     0.91%  haproxy                   [.] conn_backend_get
     0.82%  haproxy                   [.] connect_server
     0.82%  haproxy                   [.] run_tasks_from_lists

Tested on an Ampere Altra with 64 aarch64 cores dedicated to haproxy,
the gain is even more visible (3.6x):

  Before: 311-323k rps, 3.16-3.25ms, 6400% CPU
  Overhead  Shared Object     Symbol
    55.69%  haproxy-master    [.] fwlc_srv_reposition
    33.30%  haproxy-master    [.] fwlc_get_next_server
     0.89%  haproxy-master    [.] process_stream
     0.45%  haproxy-master    [.] h1_snd_buf
     0.34%  haproxy-master    [.] run_tasks_from_lists
     0.32%  haproxy-master    [.] connect_server
     0.31%  haproxy-master    [.] conn_backend_get
     0.31%  haproxy-master    [.] h1_headers_to_hdr_list
     0.24%  haproxy-master    [.] srv_add_to_idle_list
     0.23%  haproxy-master    [.] http_request_forward_body
     0.22%  haproxy-master    [.] __pool_alloc
     0.21%  haproxy-master    [.] http_wait_for_response
     0.21%  haproxy-master    [.] h1_send

  After: 1.21M rps, 0.842ms, 6400% CPU
  Overhead  Shared Object     Symbol
    17.44%  haproxy           [.] fwlc_get_next_server
     6.33%  haproxy           [.] process_stream
     4.40%  haproxy           [.] fwlc_srv_reposition
     3.64%  haproxy           [.] conn_backend_get
     2.75%  haproxy           [.] connect_server
     2.71%  haproxy           [.] h1_snd_buf
     2.66%  haproxy           [.] srv_add_to_idle_list
     2.33%  haproxy           [.] run_tasks_from_lists
     2.14%  haproxy           [.] h1_headers_to_hdr_list
     1.56%  haproxy           [.] stream_set_backend
     1.37%  haproxy           [.] http_request_forward_body
     1.35%  haproxy           [.] http_wait_for_response
     1.34%  haproxy           [.] h1_send

And at similar loads, the CPU usage considerably drops (3.55x), as
well as the response time (10x):

  After: 320k rps, 0.322ms, 1800% CPU
  Overhead  Shared Object     Symbol
     7.62%  haproxy           [.] process_stream
     4.64%  haproxy           [.] h1_headers_to_hdr_list
     3.09%  haproxy           [.] h1_snd_buf
     3.08%  haproxy           [.] h1_process_demux
     2.22%  haproxy           [.] __pool_alloc
     2.14%  haproxy           [.] connect_server
     1.87%  haproxy           [.] h1_send
   > 1.84%  haproxy           [.] fwlc_srv_reposition
     1.84%  haproxy           [.] run_tasks_from_lists
     1.77%  haproxy           [.] sock_conn_iocb
     1.75%  haproxy           [.] srv_add_to_idle_list
     1.66%  haproxy           [.] http_request_forward_body
     1.65%  haproxy           [.] wake_expired_tasks
     1.59%  haproxy           [.] h1_parse_msg_hdrs
     1.51%  haproxy           [.] http_wait_for_response
   > 1.50%  haproxy           [.] fwlc_get_next_server

The cost of fwlc_get_next_server() naturally increases as the server
count increases, but now has no visible effect on updates. The load
distribution remains unchanged compared to the previous approach,
the weight still being respected.

For further improvements to the fwlc algo, please consult github
issue #881 which centralizes everything related to this algorithm.

commit | commitdiff | tree

Willy Tarreau [Tue, 11 Feb 2025 16:18:36 +0000 (17:18 +0100)]

MEDIUM: server: allocate a tasklet for asyncronous requeuing

This creates a tasklet that only expects to be called when the LB
algorithm is under contention when trying to reposition the server
in its tree. Indeed, that's one of the operations that usually
requires to take a write lock on a highly contended area, often
for very little benefits under contention; indeed, under load, if
a server keeps its previous position for a few extra microseconds,
usually there's no harm. Thus this new tasklet can be woken up by
the LB algo to ask the server to later call lbprm.server_requeue().
It does nothing else.

commit | commitdiff | tree

Willy Tarreau [Tue, 11 Feb 2025 16:16:14 +0000 (17:16 +0100)]

MINOR: lbprm: add a new callback ->server_requeue to the lbprm

This callback will be used to reposition a server to its expected
position regardless of the fact that it was taken or dropped. It
will only be used by supporting LB algos. For now, only fwlc defines
it and assigns it to fwlc_srv_reposition(). At the moment it's not
used yet.

commit | commitdiff | tree

Willy Tarreau [Mon, 10 Feb 2025 10:15:44 +0000 (11:15 +0100)]

DEBUG: thread: reduce the struct lock_stat to store only 30 buckets

Storing only 30 buckets means we only keep 256 bytes per label. This
further simplifies address calculation and reduces the memory used
without complicating the locking code. It means we won't measure wait
times larger than a second but we're not supposed to face this as it
would trigger the watchdog anyway. It may become a little bit just if
measuring using rdtsc() instead of now_mono_time() though (typically
the limit would be around 350ms for a 3 GHz CPU).

commit | commitdiff | tree

Willy Tarreau [Mon, 10 Feb 2025 10:08:45 +0000 (11:08 +0100)]

DEBUG: thread: make lock_stat per operation instead of for all operations

It's more convenient (and more readable) to have the lock stats arranged
by operation type (read, seek, write). It will also allow to later simplify
the structure format and the bucket address calculation. Now lock_stat[]
got split into lock_stats_rd[], lock_stats_sk[], lock_stats_wr[].

commit | commitdiff | tree

Willy Tarreau [Mon, 10 Feb 2025 09:56:44 +0000 (10:56 +0100)]

DEBUG: thread: don't keep the redundant _locked counter

Now that we have our sums by bucket, the _locked counter is redundant
since it's always equal to the sum of all entries. Let's just get rid
of it and replace its consumption with a loop over all buckets, this
will reduce the overhead of taking each lock at the expense of a tiny
extra effort when dumping all locks, which we don't care about.

commit | commitdiff | tree

Willy Tarreau [Mon, 10 Feb 2025 09:41:35 +0000 (10:41 +0100)]

DEBUG: thread: report the wait time buckets for lock classes

In addition to the total/average wait time, we now also store the
wait time in 2^N buckets. There are 32 buckets for each type (read,
seek, write), allowing to store wait times from 1-2ns to 2.1-4.3s,
which is quite sufficient, even if we'd want to switch from NS to
CPU cycles in the future. The counters are only reported for non-
zero buckets so as not to visually pollute the output.

This significantly inflates the lock_stat struct, which is now
aligned to 256 bytes and rounded up to 1kB. But that's not really
a problem, given that there's only one per lock label.

commit | commitdiff | tree

Willy Tarreau [Mon, 10 Feb 2025 08:26:04 +0000 (09:26 +0100)]

DEBUG: thread: make lock time computation more consistent

The lock time computation was a bit inconsistent between functions,
particularly those using a try_lock. Some of them would count the lock
as taken without counting the time, others would simply not count it.
This is essentially due to the way the time is retrieved, as it was
done inside the atomic increment.

Let's instead always use start_time to carry the elapsed time, by
presetting it to the negative time before the event and addinf the
positive time after, so that it finally contains the duration. Then
depending on the try lock's success, we add the result or not. This
was generalized to all lock functions for consistency, and because
this will be handy for future changes.

commit | commitdiff | tree

Willy Tarreau [Mon, 10 Feb 2025 07:02:42 +0000 (08:02 +0100)]

DEBUG: thread: report the spin lock counters as seek locks

Technically speaking, spin locks use a seek lock, not a write lock,
so better count them appropriately for consistency (lock time, or
function calls count).

commit | commitdiff | tree

Willy Tarreau [Mon, 10 Feb 2025 16:59:40 +0000 (17:59 +0100)]

BUG/MEDIUM: debug: close a possible race between thread dump and panic()

The rework of the thread dumping mechanism in 2.8 with commit 9a6ecbd590
("MEDIUM: debug: simplify the thread dump mechanism") opened a small
race, which is that a thread in the process of dumping other ones may
block the other one from panicing while it's looping at the end of
ha_thread_dump_fill(), or any other sequence involving the currently
dumped one.

This was emphasized in 3.1 with commit 148eb5875f ("DEBUG: wdt: better
detect apparently locked up threads and warn about them") that allowed
to emit warnings about long-stuck threads, because in this case, what
happens is that sometimes a thread starts to emit a warning (or a set
of warnings), and while the warning is being awaited for, a panic
finally happens and interrupts either the dumping thread, which never
finishes and waits for the target's pointer to become NULL which will
never happen since it was supposed to do it itself, or the currently
dumped thread which could wait for the dumping thread to become ready
while this one has not released the former.

In order to address this, first we now make sure never to dump a thread
that is already in the process of dumping another one. We're adding a
new thread flag to know this situation, that is set in ha_thread_dump_fill()
and cleared in ha_thread_dump_done(). And similarly, we don't trigger
the watchdog on a thread waiting for another one to finish its dump,
as it's likely a case of warning (and maybe even a panic) that makes
them wait for each other and we don't want such cases to be reentrant.
Finally, we check in the main polling loop that the flag never accidentally
leaked (e.g. wrong flag manipulation) as this would be difficult to spot
with bad consequences.

This should be backported at least to 2.8, and should resolve github
issue #2860. Thanks to Chris Staite for the very informative backtrace
that exhibited the problem.

commit | commitdiff | tree

Willy Tarreau [Sat, 8 Feb 2025 04:53:40 +0000 (05:53 +0100)]

[RELEASE] Released version 3.2-dev5

Released version 3.2-dev5 with the following main changes :
    - BUG/MINOR: ssl: put ssl_sock_load_ca under SSL_NO_GENERATE_CERTIFICATES
    - CLEANUP: ssl: rename ssl_sock_load_ca to ssl_sock_gencert_load_ca
    - CLEANUP: ssl: move ssl_sock_gencert_load_ca declaration in ssl_gencert.h
    - CLEANUP: tree-wide: define and use acl_match_cond() helper
    - MINOR: epoll: permit to mask certain specific events
    - MINOR: proxies: Add a per-thread group field to struct proxy.
    - MINOR: Add fields to the per-thread group field in struct server.
    - MINOR: proxies/servers: Calculate queueslength and use it.
    - MEDIUM: servers/proxies: Switch to using per-tgroup queues.
    - BUG/MINOR: stream: Properly handle "on-marked-up shutdown-backup-sessions"
    - MEDIUM: stream: Map task wake up reasons to dedicated stream events
    - MEDIUM: stream: No longer use TASK_F_UEVT* to shut a stream down
    - BUILD: tools: fix build on BSD by dropping the ETIME check
    - MINOR: queues: use __ha_cpu_relax() on failed CAS.
    - BUILD: queues: Use unsigned int when needed
    - BUILD: ssl: allow to build without the renegotiation API of WolfSSL
    - BUILD: ssl: more cleaner approach to WolfSSL without renegotiation
    - BUG/MEDIUM: chunk: make sure to flush the trash pool before resizing
    - MINOR: quic: remove references to burst in quic-cc-algo parsing
    - MINOR: quic: allow BBR testing without pacing
    - MINOR: quic: transform pacing settings into a global option
    - MAJOR: quic: mark pacing as stable and enable it by default
    - MINOR: quic: mark BBR as stable
    - MINOR: quic: define quic_tune
    - BUILD: quic: fix overflow in global tune
    - DEBUG: fd: add a counter of takeovers of an FD since it was last opened
    - MINOR: fd: add a generation number to file descriptors
    - DEBUG: epoll: store and compare the FD's generation count with reported event
    - MEDIUM: epoll: skip reports of stale file descriptors
    - MINOR: mux-h1: Add masks to group H1S DEMUX and MUX errors
    - BUG/MINOR: mux-h1: Only report a SE error on demux error
    - MINOR: tevt: Add the termination events log's fundations
    - MINOR: tevt/stconn: Add a termination events log in the SE descriptor
    - MINOR: tevt/mux-h1: Report termination events for the H1C and H1S
    - MINOR: tevt/mux-h2: Report termination events for the H2C
    - MINOR: tevt/stream/stconn: Report termination events for stream and sc
    - MINOR: tevt/conn: Report intercepted event for L4 rules
    - MINOR: tevt/mux-h1/mux-h2: Add termination events log when dumping mux info
    - MINOR: tevt/muxes: Add CTL and SCTL command to get the termination event logs
    - MINOR: tevt/mux-pt: Add support for termination event logs
    - MINOR: tevt/connection: Add dedicated termination events for lower locations
    - MEDIUM: tevt/muxes: Add dedicated termination events for muxc/se locations
    - MINOR: tevt/stconn: Be more accurate to report shutw events
    - MEDIUM: tevt/stconn/stream: Add dedicated termination events for stream location
    - MINOR: tevt: Don't duplicate termination event during reporting
    - MINOR: tevt/applet:  Add limited support for termination event logs for applets
    - MINOR: tevt: Add a sample to get termination events for all locations
    - MINOR: tevt: Improve function to convert a termination events log to string
    - REORG: tevt/connection: Move enums at the end of the header file
    - MINOR: tevt/dev: Add term_events tool
    - MINOR: tevt/connection: Add support for POLL_HUP/POLL_ERR events
    - MINOR: tevt/dev: Parse tuple of termination events
    - BUG/MEDIUM: htx: wrong count computation in htx_xfer_blks()
    - DOC: htx: clarify <mark> parameter for htx_xfer_blks()
    - BUILD: quic: remove GCC undefined error in qc_release_lost_pkts()
    - MEDIUM: htx: prevent <mark> to copy incomplete headers in htx_xfer_blks()
    - BUG/MEDIUM: mux-fcgi: Properly handle read0 on partial records
    - BUG/MINOR: tevt/http-ana: Remove badly placed event reports
    - DEBUG: http-ana: Remove debug counters from HTTP analyzers
    - DEBUG: mux-h1: Remove some debug counters
    - BUG/MINOR: tcp-rules: Don't forward close during tcp-response content rules eval
    - MEDIUM: stream: interrupt costly rulesets after too many evaluations
    - BUG/MINOR: http-check: Don't pretend a C-L heeader is set before adding it
    - BUILD: ssl: remove a boringssl definition defined by recent boringssl libs
    - BUG/MINOR: tevt/mux-h2: Set truncated receive/eos events at SE level on error
    - BUG/MEDIUM: flt-spoe: Set/test applet flags instead of SE flags from I/O handler
    - BUG/MEDIUM: applet: Don't pretend to have more data to handle EOI/EOS/ERROR
    - BUG/MEDIUM: flt-spoe: Properly handle end of stream from the SPOE applet
    - MINOR: flt-spoe: Report end of input immediately after applet init
    - MINOR: mux-spop: Report EOI on the SE when a ACK is received for a stream
    - MINOR: mux-spop: Set SPOP_CF_ERROR flag on connection error only
    - MINOR: tevt/mux-spop:  Report termination events for the SPOP connect/stream
    - CLEANUP: mux-spop: Remove useless comments
    - MINOR: mux-spop: Dump info about connections and streams in dedicated functions
    - MINOR: mux-spop: Implement .show_sd callback function
    - MEDIUM: mux-fcgi: Add a function to propagate termination flags from fstrm to SE
    - BUG/MEDIUM: mux-fcgi: Propagate flags to SE in fcgi_strm_wake_one_stream
    - MINOR: tevt/mux-fcgi:  Report termination events for the FCGI connect/stream
    - MINOR: mux-fcgi: Dump info about connections and streams in dedicated functions
    - MINOR: mux-spop/mux-fcgi: Add support of the debug string for logs
    - BUG/MINOR: cli: Don't set SE flags from the cli applet
    - BUG/MINOR: cli: Fix memory leak on error for _getsocks command
    - BUG/MINOR: cli: Fix a possible infinite loop in _getsocks()
    - BUG/MINOR: config/userlist: Support one 'users' option for 'group' directive
    - BUG/MINOR: auth: Fix a leak on error path when parsing user's groups
    - BUG/MINOR: flt-trace: Support only one name option
    - MINOR: filters: Improve errors formating during filters parsing
    - BUG/MINOR: stats-json: Define JSON_INT_MAX as a signed integer
    - DOC: option redispatch should mention persist options
    - BUG/MINOR: debug: make "debug dev sched" accept a negative TID
    - BUG/MINOR: debug: make sure the "debug dev sched" tasks don't block stopping
    - IMPORT: plock: export the uninlined version of the lock wait function
    - IMPORT: plock: give higher precedence to W than S
    - IMPORT: plock: lower the slope of the exponential back-off
    - IMPORT: plock: use cpu_relax() for a shorter time in EBO
    - Revert "IMPORT: plock: export the uninlined version of the lock wait function"
    - BUG/MEDIUM: ssl: chosing correct certificate using RSA-PSS with TLSv1.3

commit | commitdiff | tree

William Lallemand [Fri, 7 Feb 2025 19:28:39 +0000 (20:28 +0100)]

BUG/MEDIUM: ssl: chosing correct certificate using RSA-PSS with TLSv1.3

The clienthello callback was written when TLSv1.3 was not yet out, and
signatures algorithm changed since then.

With TLSv1.2, the least significant byte was used to determine the
SignatureAlgorithm, which could be rsa(1), dsa(2), ecdsa(3).
https://datatracker.ietf.org/doc/html/rfc5246#section-7.4.1.4.1

This was used to chose which type of certificate to push to the client.

But TLSv1.3 changed that, and introduced new RSA-PSS algorithms that
does not have the least sinificant byte to 1.
https://datatracker.ietf.org/doc/html/rfc8446#section-4.2.3

This would result in chosing the wrong certificate when an RSA an ECDSA
ones are in the configuration for the same SNI or default entry.

This patch fixes the issue by parsing bothe hash and signature field to
check the RSA-PSS signature scheme.

This must fix issue #2852.

This must be backported in every stable versions. The code was moved
from ssl_sock.c to ssl_clienthello in recent versions.

commit | commitdiff | tree

Willy Tarreau [Fri, 7 Feb 2025 18:51:15 +0000 (19:51 +0100)]

Revert "IMPORT: plock: export the uninlined version of the lock wait function"

This reverts commit 5496d06b2b1ea276ffb6aec78ffca177b88d89cd.

It breaks the build on Windows which apparently doesn't support the weak
attribute well on functions. It's not big deal anyway, playing with build
options while debugging still works though it's less easy to use.

commit | commitdiff | tree

Willy Tarreau [Fri, 7 Feb 2025 16:33:49 +0000 (17:33 +0100)]

IMPORT: plock: use cpu_relax() for a shorter time in EBO

Tests have shown that on modern CPUs it's interesting to wait a bit less
in cpu_relax(). Till now we were looping down to 60 iterations and then
switching to just barriers. Increasing the threshold to 90 iterations
left before getting out of the loop improved the average and max time
to grab a write lock by a few percent (e.g. 10% at 1us, 20% at 256ns
or lower). Higher values tend to progressively lose that gain so let's
stick to this one. This was measured on an EPYC 74F3 like previous
measurements that initially led to this value, and the value might
possibly depend on the mask applied to the loop counter.

This is plock commit 74ca0a7307fa6aec3139f27d3b7e534e1bdb748e.

commit | commitdiff | tree

Willy Tarreau [Fri, 7 Feb 2025 16:20:48 +0000 (17:20 +0100)]

IMPORT: plock: lower the slope of the exponential back-off

Along many tests involving both haproxy's scheduler and forwarded
traffic, various exponents and algorithms were attempted for the EBO
and their effects were measured. It was found that a growth in 1.25^N
limited to 128k cycles consistently gives a better latency than 1.5^N
limited to 256k cycles, without degrading general performance. The
measures of the time to grab a write lock on a 48-thread EPYC show
that the number of occurrences of low times was roughly multiplied by
2-3 while the number of occurrences of times above 64us was reduced
by similar factors, to even reach 300 at 64us and limiting the maximum
time by a factor of 4.

The other variants that were experimented with are:

  m = ((m + (m >> 1)) + 2) & 0x3ffff;            // original
  m = ((m + (m >> 1) + (m >> 3)) + 2) & 0x3ffff;
  m = ((m + (m >> 1) + (m >> 4)) + 2) & 0x3ffff;
  m = ((m + (m >> 1) + (m >> 4)) + 2) & 0x1ffff;
  m = ((m + (m >> 1) + (m >> 4)) + 1) & 0x1ffff;
  m = ((m + (m >> 2) + (m >> 4)) + 1) & 0x1ffff; // lowest CPU on pl_wr test + good perf
  m = ((m + (m >> 2)) + 1) & 0x1ffff;            // even lower cpu usage, lowest max
  m = ((m + (m >> 1) + (m >> 2)) + 1) & 0x1ffff; // correct but slightly higher maxes
  m = ((m + (m >> 1) + (m >> 3)) + 1) & 0x1ffff; // less good than m+m>>2
  m = ((m + (m >> 2) + (m >> 3)) + 1) & 0x1ffff; // better but not as good as m+m>>2
  m = ((m + (m >> 3) + (m >> 4)) + 1) & 0x1ffff; // less good, lower rates on small coounts.
  m = ((m + (m >> 2) + (m >> 3) + (m >> 4)) + 1) & 0x1ffff; // less good as well
  m = ((m & 0x7fff) + (m >> 1) + (m >> 4)) + 2;
  m = ((m & 0xffff) + (m >> 1) + (m >> 4)) + 2;

This is plock commit dddd9ee01c522da33c353e2e4d4fd743d8336ec3.

commit | commitdiff | tree

Willy Tarreau [Fri, 7 Feb 2025 15:57:28 +0000 (16:57 +0100)]

IMPORT: plock: give higher precedence to W than S

It was noticed in haproxy that in certain extreme cases, a write lock
subject to EBO may fail for a very long time in front of a large set
of readers constantly trying to upgrade to the S state. The reason is
that among many readers, one will succeed in its upgrade, and this
situation can last for a very long time with many readers upgrading
in turn, while the writer waits longer and longer before trying again.

Here we're taking a reasonable approach which is that the write lock
should have a higher precedence in its attempt to grab the lock. What
is done is that instead of fully rolling back in case of conflict with
a pure S lock, the writer will only release its read part in order to
let the S upgrade to W if needed, and finish its operations. This
guarantees no other seek/read/write can enter. Once the conflict is
resolved, the writer grabs the read part again and waits for readers
to be gone (in practice it could even return without waiting since we
know that any possible wanderers would leave or even not be there at
all, but it avoids a complicated loop code that wouldn't improve the
practical situation but inflate the code).

Thanks to this change, the maximum write lock latency on a 48 threads
AMD with aheavily loaded scheduler went down from 256 to 64 ms, and the
number of occurrences of 32ms or more was divided by 300, while all
occurrences of 1ms or less were multiplied by up to 3 (3 for the 4-16ns
cases).

This is plock commit b6a28366d156812f59c91346edc2eab6374a5ebd.

commit | commitdiff | tree

Willy Tarreau [Fri, 7 Feb 2025 15:45:21 +0000 (16:45 +0100)]

IMPORT: plock: export the uninlined version of the lock wait function

The inlining of the lock waiting function was made more easily
configurable with commit 7505c2e ("plock: always expose the inline
version of the lock wait function"). However, the standard one remained
static, but in order to resolve the symbols in "perf top", it's much
better to export it, so let's move "static" with "inline" and leave it
exported when PLOCK_INLINE_EBO is not set.

This is plock commit 3bea7812ec705b9339bbb0ed482a2cd8aa6c185c.

commit | commitdiff | tree

Willy Tarreau [Fri, 7 Feb 2025 17:01:32 +0000 (18:01 +0100)]

BUG/MINOR: debug: make sure the "debug dev sched" tasks don't block stopping

When "debug dev sched" is used to pop up background tasks, these tasks
are never stopped, so we must be careful to stop them when the stopping
flag is set, otherwise they can prevent the process from stopping when
sufficiently numerous (tests went as far as 100 million tasks, leading
the run queue never being completely purged in one poll round).

No backport is needed since this is only used when debugging and tuning
the scheduler.

commit | commitdiff | tree

Willy Tarreau [Fri, 7 Feb 2025 16:59:11 +0000 (17:59 +0100)]

BUG/MINOR: debug: make "debug dev sched" accept a negative TID

The TID passed to "debug dev sched" is used to pin the task to a given
thread. A negative value normally means the task is unpinned and goes
to the shared wait queue and run queue. However due to the type of the
variable, negative values were mapped as highly positive values and were
set to the current thread. Let's add the proper cast to fix this.

No backport is needed since this is only used to experiment with the
scheduler and measure its performance.

commit | commitdiff | tree

Lukas Tribus [Wed, 5 Feb 2025 07:42:15 +0000 (07:42 +0000)]

DOC: option redispatch should mention persist options

"option redispatch" remains vague in which cases a session would persist;
let's mention "option persist" and "force-persist" as an example so folks
don't draw the conclusion that this may be default.

Should be backported to stable branches.

commit | commitdiff | tree

Christopher Faulet [Thu, 6 Feb 2025 16:13:50 +0000 (17:13 +0100)]

BUG/MINOR: stats-json: Define JSON_INT_MAX as a signed integer

A JSON integer is defined in the range [-(2**53)+1, (2**53)-1]. Macro are used
to define the minimum and the maximum value, The minimum one is defined using
the maximum one. So JSON_INT_MAX must be defined as a signed integer value to
avoid wrong cast of JSON_INT_MIN.

It was reported by Coverity in #2841: CID 1587769.

This patch could be backported to all stable versions.

commit | commitdiff | tree

Christopher Faulet [Thu, 6 Feb 2025 16:03:35 +0000 (17:03 +0100)]

MINOR: filters: Improve errors formating during filters parsing

The error message reported by a filter during parsing are displayed between
quotes. It is not really user friendly. So let's remove the quotes here.

commit | commitdiff | tree

Christopher Faulet [Thu, 6 Feb 2025 16:01:08 +0000 (17:01 +0100)]

BUG/MINOR: flt-trace: Support only one name option

When a trace filter is defined, only one 'name' option is expected. But it
was not tested. Thus it was possible to set several names leading to a
memory leak.

It is now tested, and it is not allowed to redefine the trace filter name.

It was reported by Coverity in #2841: CID 1587768.

This patch could be backported to all stable versions.

commit | commitdiff | tree

Christopher Faulet [Thu, 6 Feb 2025 15:52:17 +0000 (16:52 +0100)]

BUG/MINOR: auth: Fix a leak on error path when parsing user's groups

In a userlist section, when a user is parsed, if a specified group is not
found, an error is reported. In this case we must take care to release the
alredy built groups list.

It was reported by Coverity in #2841: CID 1587770.

This patch could be backported to all stable versions.

commit | commitdiff | tree

Christopher Faulet [Thu, 6 Feb 2025 15:21:20 +0000 (16:21 +0100)]

BUG/MINOR: config/userlist: Support one 'users' option for 'group' directive

When a group is defined in a userlist section, only one 'users' option is
expected. But it was not tested. Thus it was possible to set several options
leading to a memory leak.

It is now tested, and it is not allowed to redefine the users option.

It was reported by Coverity in #2841: CID 1587771.

This patch could be backported to all stable versions.

commit | commitdiff | tree

Christopher Faulet [Thu, 6 Feb 2025 14:37:52 +0000 (15:37 +0100)]

BUG/MINOR: cli: Fix a possible infinite loop in _getsocks()

In _getsocks() functuoin, when we failed to set the unix socket in
non-blocking mode, a goto to "out" label led to loop infinitly. To fix the
issue, we must only let the function exit.

This patch should be backported to all stable versions.

commit | commitdiff | tree

Christopher Faulet [Thu, 6 Feb 2025 14:30:30 +0000 (15:30 +0100)]

BUG/MINOR: cli: Fix memory leak on error for _getsocks command

Some errors in parse function of _getsocks commands were not properly handled
and immediately returned, leading to a memory leak on cmsgbuf and tmpbuf
buffers.

To fix the issue, instead of immediately return with -1, we jump to "out"
label. Returning 1 intead of -1 in that case is valid.

This was reported by Coverity in #2841: CIDs 1587773 and 1587772.

This patch should be backported as far as 2.4.

commit | commitdiff | tree

Christopher Faulet [Thu, 6 Feb 2025 14:15:27 +0000 (15:15 +0100)]

BUG/MINOR: cli: Don't set SE flags from the cli applet

Since the CLI was updated to use the new applet API, it should no longer set
directly the SE flags. Instead, the corresponding applet flags must be set,
using the applet API (appet_set_*). It is true for the CLI I/O handler but also
for the commands parse function and I/O callback function.

This patch should be backported as far as 3.0.

commit | commitdiff | tree

Christopher Faulet [Wed, 5 Feb 2025 15:04:26 +0000 (16:04 +0100)]

MINOR: mux-spop/mux-fcgi: Add support of the debug string for logs

Now it is possible to have debug info about FCGI and SPOP multiplexers. To do
so, the support for the MUX_SCTL_DBG_STR command was implemented for these
muxes.

The have this log message, the log-format must be set to:

log-format "$HAPROXY_HTTP_LOG_FMT bs=<%[bs.debug_str]>"

commit | commitdiff | tree

Christopher Faulet [Wed, 5 Feb 2025 14:56:36 +0000 (15:56 +0100)]

MINOR: mux-fcgi: Dump info about connections and streams in dedicated functions

fcgi_show_fd() function was splitted to dump the info about the FCGI
connections and the FCGI streams in dedicated functions, duplicating this
way what is performed in other muxes.

In addition, the FCGI multiplexer now implements the .show_sd callback
function called by "show sess" CLI command.

commit | commitdiff | tree

Christopher Faulet [Wed, 5 Feb 2025 14:45:39 +0000 (15:45 +0100)]

MINOR: tevt/mux-fcgi: Report termination events for the FCGI connect/stream

Termination events are now reported for the FCGI connections and the FCGI
streams. In addition, all available termination events logs are reported in
the "show-fd" callback function. The .ctl and .sctl callback functions were
also update to support, respectively, MUX_CTL_TEVTS and MUX_SCTL_TEVTS
commands.

commit | commitdiff | tree

Christopher Faulet [Wed, 5 Feb 2025 14:06:45 +0000 (15:06 +0100)]

BUG/MEDIUM: mux-fcgi: Propagate flags to SE in fcgi_strm_wake_one_stream

The commit is flagged as a bug because the same fix on the H2 multiplexer was
reported as a bug. But no issue was reported.

When a stream is explicitly woken up by the FCGI conneciton, if an error
condition is detected, the corresponding error flag is set on the SE. So
SE_FL_ERROR or SE_FL_ERR_PENDING, depending if the end of stream was
reported or not.

However, there is no attempt to propagate other termination flags. We must
be sure to properly set SE_FL_EOI and SE_FL_EOS when appropriate to be able
to switch a pending error to a fatal error.

Because of this bug, the SE could remain with a pending error and no end of
stream, preventing the applicative stream to trully abort it. It means on
some abort scenario, it seems to be possible to block a stream infinitely.

This patche depends on:

* MEDIUM: mux-fcgi: Add a function to propagate termination flags from fstrm to SE
* BUG/MEDIUM: mux-fcgi: Properly handle read0 on partial records

This patch could be backported at least as far as 2.8 after a period of
observation. However no bug was reportedn so there is no rush.

commit | commitdiff | tree

Christopher Faulet [Wed, 5 Feb 2025 13:28:47 +0000 (14:28 +0100)]

MEDIUM: mux-fcgi: Add a function to propagate termination flags from fstrm to SE

The function fcgi_strm_propagate_term_flags() was added to check the FSTRM
state and evaluate when EOI/EOS/ERR_PENDING/ERROR flags must be set on the
SE. It is not the only place where those flags are set. But it centralizes
the synchro between the FCGI stream and the SC.

For now, this function is only used at the end of fcgi_rcv_buf(). But it
will be used to fix a potential bug.

commit | commitdiff | tree

Christopher Faulet [Tue, 4 Feb 2025 10:54:05 +0000 (11:54 +0100)]

MINOR: mux-spop: Implement .show_sd callback function

The SPOP multiplexer now implements the .show_sd callback function called by
"show sess" CLI command.

commit | commitdiff | tree

Christopher Faulet [Tue, 4 Feb 2025 10:50:17 +0000 (11:50 +0100)]

MINOR: mux-spop: Dump info about connections and streams in dedicated functions

spop_show_fd() function was splitted to dump the info about the SPOP
connections and the SPOP streams in dedicated functions, duplicating this
way what is performed in other muxes.

commit | commitdiff | tree

Christopher Faulet [Tue, 4 Feb 2025 10:21:23 +0000 (11:21 +0100)]

CLEANUP: mux-spop: Remove useless comments

Just a small cleanup to remove some comments added during the development of
the mux.

commit | commitdiff | tree

Christopher Faulet [Tue, 4 Feb 2025 10:03:31 +0000 (11:03 +0100)]

MINOR: tevt/mux-spop: Report termination events for the SPOP connect/stream

Termination events are now reported for the SPOP connections and the SPOP
streams. In addition, all available termination events logs are reported in
the "show-fd" callback function. The .ctl and .sctl callback functions were
also update to support, respectively, MUX_CTL_TEVTS and MUX_SCTL_TEVTS
commands.

commit | commitdiff | tree

Christopher Faulet [Tue, 4 Feb 2025 09:58:35 +0000 (10:58 +0100)]

MINOR: mux-spop: Set SPOP_CF_ERROR flag on connection error only

The SPOP_CF_ERROR flag is now set on connection error only. It was also set
on some demux failures. But it is not mandatory because the connection is
closed anyway. And it is handy to have a flag dedicated to tcp connection
error. It was the original purpose of this flag.

This patch could be backported to 3.1 to ease future backports.

commit | commitdiff | tree

Christopher Faulet [Tue, 4 Feb 2025 09:53:20 +0000 (10:53 +0100)]

MINOR: mux-spop: Report EOI on the SE when a ACK is received for a stream

The spop stream now reports the end of input when the ACK is transferred to
the SPOE applet. To do so, the flag SPOP_SF_ACK_RCVD was added. It is set on
the SPOP stream when its ACK is received by the SPOP connection.

In addition when SPOP stream flags are propagated to the SE, the error is
now reported if end of input was not reached instead of testing the
connection error code. It is more accurate.

This patch should be backported to 3.1.

commit | commitdiff | tree

Christopher Faulet [Tue, 4 Feb 2025 09:46:28 +0000 (10:46 +0100)]

MINOR: flt-spoe: Report end of input immediately after applet init

The SPOE applet forwards the message that must be sent to agent during its
init stage. So just after it is created. When it is performed, the end of
input must be reported because no more data will be forwarded. However, it
was performed after receiving the ACK response. It is harmless, but there is
no reason to delay the EOI. It is now fixed.

This patch must be backported to 3.1.

commit | commitdiff | tree

Christopher Faulet [Tue, 4 Feb 2025 17:05:33 +0000 (18:05 +0100)]

BUG/MEDIUM: flt-spoe: Properly handle end of stream from the SPOE applet

The previous fix ("BUG/MEDIUM: applet: Don't pretend to have more data to
handle EOI/EOS/ERROR") revealed an issue with the way the SPOE applet was
reporting the end of stream, leading to never shut the applet down.

In fact, there is two bug in one. The first one is about the applet
shutdown. Since the fix above, the applet is no longer closed. Before, it
was closed because it was reported in error. But now, it is just delayed
because the applet and the SPOP stream are declared to support half close
connections. So the applet is only closed when the SPOP connection is
closed. To fix this bug, both side are now stating that half close
connections are not supported.

The second bug is about the way the end of stream is reported. It is
reported when the ACK response is received. But it is too early, because the
parent stream must process the response first. So now, we take care to have
processed the ACK from the parent applet before reporting an end of stream.

This patch must be backported with the commit above to 3.1.

commit | commitdiff | tree

Christopher Faulet [Tue, 4 Feb 2025 08:20:36 +0000 (09:20 +0100)]

BUG/MEDIUM: applet: Don't pretend to have more data to handle EOI/EOS/ERROR

The way appctx EOI/EOS/ERROR flags were reported for applets using the new
API were to state the applet had more data to deliver. But it was not
correct and for APPCTX_FL_EOS, this led to report an error on the SE because
it is not expected. More data to deliver and an end of stream is an
impossible situation.

This was added as a fix by commit b8ca114031 ("BUG/MEDIUM: applet: State
appctx have more data if its EOI/EOS/ERROR flag is set"), mainly to make the
SPOE applet work.

When an applet set one of these flags, it really means it has no more data
to deliver. So we must not try to trigger a new receive to handle these
flags. Instead we must handle them directly in task_process_applet()
function and only if the corresponding SE flags were not already set.

This patch must be backported to 3.1.

commit | commitdiff | tree

Christopher Faulet [Tue, 4 Feb 2025 09:42:19 +0000 (10:42 +0100)]

BUG/MEDIUM: flt-spoe: Set/test applet flags instead of SE flags from I/O handler

The SPOE applet is using the new applet API. Thus end of input, end of
stream and errors must be reported using the applet flags, not the SE
flags. This was not the case. So let's fix it.

It seems this bug is harmless for now.

This patch must be backported to 3.1.

commit | commitdiff | tree

Christopher Faulet [Tue, 4 Feb 2025 07:21:06 +0000 (08:21 +0100)]

BUG/MINOR: tevt/mux-h2: Set truncated receive/eos events at SE level on error

When receive or EOS termination events are reported at the SE level, a
truncation was erroneously reported when no error was detected. Of course, it
must be the opposite.

No backport needed.

commit | commitdiff | tree

Frederic Lecaille [Thu, 6 Feb 2025 09:48:25 +0000 (10:48 +0100)]

BUILD: ssl: remove a boringssl definition defined by recent boringssl libs

This is the case for AWS-LC which derives from boringssl, where
X509_OBJECT_get0_X509_CRL() is already defined. There is definitively
no more need to define this function to build haproxy against TLS libs derived
from boringssl.

commit | commitdiff | tree

Christopher Faulet [Mon, 3 Feb 2025 17:36:17 +0000 (18:36 +0100)]

BUG/MINOR: http-check: Don't pretend a C-L heeader is set before adding it

When a GET/HEAD/OPTIONS/DELETE healthcheck request was formatted, we claimed
there was a "content-length" header set even when there was no payload,
leading to actually send a "content-length: 0" header to the server. It was
unexpected and could be rejected by servers.

When a healthcheck request is sent we must take care to state there is a
"content-length" header when it is explicitly added.

This patch should fix the issue #2851. It must be backported as far as 2.9.

commit | commitdiff | tree

Aurelien DARRAGON [Thu, 30 Jan 2025 12:26:42 +0000 (13:26 +0100)]

MEDIUM: stream: interrupt costly rulesets after too many evaluations

It is not rare to see configurations with a large number of "tcp-request
content" or "http-request" rules for instance. A large number of rules
combined with cpu-demanding actions (e.g.: actions that work on content)
may create thread contention as all the rules from a given ruleset are
evaluated under the same polling loop if the evaluation is not interrupted

Thus, in this patch we add extra logic around "tcp-request content",
"tcp-response content", "http-request" and "http-response" rulesets, so
that when a certain number of rules are evaluated under the single polling
loop, we force the evaluating function to yield. As such, the rule which
was about to be evaluated is saved, and the function starts evaluating
rules from the save pointer when it returns (in the next polling loop).

We use task_wakeup(task, TASK_WOKEN_MSG) to explicitly wake the task so
that no time is wasted and the processing is resumed ASAP. TASK_WOKEN_MSG
is mandatory here because process_stream() expects TASK_WOKEN_MSG for
explicit analyzers re-evaluation.

rules_bcount stream's attribute was added to count how manu rules were
evaluated since last interruption (yield). Also, SF_RULE_FYIELD flag
was added to know that the s->current_rule was assigned due to forced
yield and not regular yield.

By default haproxy will enforce a yield every 50 rules, this behavior
can be configured using the "tune.max-rules-at-once" global keyword.

There is a limitation though: for now, if the ACT_OPT_FINAL flag is set
on act_opts, we consider it is not safe to yield (as it is already the
case for automatic yield). In this case instead of yielding an taking
the risk of not being called back, we skip the yield and hope it will
not create contention. This is something we should ideally try to
improve in order to yield in all conditions.

commit | commitdiff | tree

Christopher Faulet [Mon, 3 Feb 2025 14:31:57 +0000 (15:31 +0100)]

BUG/MINOR: tcp-rules: Don't forward close during tcp-response content rules eval

When the tcp-response content ruleset evaluation is delayed because of an
ACL condition, the close forwarding on the client side is not explicitly
blocked. So it is possible to close the client side before the end of the
response evaluation.

To fix the issue, this is now done in all cases where some data are
missing. Concretely, channel_dont_close() is called in "missing_data" goto
label.

Note it is only a theorical bug (or pending bug). It is not possible to
trigger it for now because an ACL cannot wait for more data when a close was
received. But the code remains a bit weak. It is safer this way. It is
especially mandatory for the "force yield" option that should be added soon.

This patch could be backported to all stable versions.

commit | commitdiff | tree

Christopher Faulet [Mon, 3 Feb 2025 07:48:30 +0000 (08:48 +0100)]

DEBUG: mux-h1: Remove some debug counters

Several debug counters were added to debug a strange issue about early
aborts. Most of them are now useless, especially because it is now possible
to rely on the termination events logs. So, it is better to remove them.

Note that these counters are still there in 3.1.

commit | commitdiff | tree

Christopher Faulet [Mon, 3 Feb 2025 07:28:43 +0000 (08:28 +0100)]

DEBUG: http-ana: Remove debug counters from HTTP analyzers

Several debug counters were added in HTTP analyzers to help debugging a
strange issue about early aborts. But these counters are a bit overkill
now. Especially because it is now possible to rely on the termination event
log. So just remove them.

Note that these counters are still there in 3.1.

commit | commitdiff | tree

Christopher Faulet [Mon, 3 Feb 2025 07:20:40 +0000 (08:20 +0100)]

BUG/MINOR: tevt/http-ana: Remove badly placed event reports

When specific events for the stream location were added, some reports about
message interception were not removed. These reports are now removed.

No need to backport.

commit | commitdiff | tree

Christopher Faulet [Mon, 27 Jan 2025 14:18:14 +0000 (15:18 +0100)]

BUG/MEDIUM: mux-fcgi: Properly handle read0 on partial records

A Read0 event could be ignored by the FCGI multiplexer if it is blocked on a
partial record. Instead of handling the event, it remained blocked, waiting
for the end of the record.

To fix the issue, the same solution than the H2 multiplexer is used. Two
flags are introduced. The first one, FCGI_CF_END_REACHED, is used to
acknowledge a read0. This flag is set when a read0 was received AND the FCGI
multiplexer must handle it. The second one, FCGI_CF_DEM_SHORT_READ, is set
when the demux is interrupted on a partial record. A short read and a read0
lead to set the FCGI_CF_END_REACHED flag.

With these changes, the FCGI mux should be able to properly handle read0 on
partial records.

This patch should be backported to all stable versions after a period of
observation.

commit | commitdiff | tree

William Lallemand [Fri, 31 Jan 2025 14:31:00 +0000 (15:31 +0100)]

MEDIUM: htx: prevent <mark> to copy incomplete headers in htx_xfer_blks()

Prevent a partial copy of trailers or headers when using the <mark>
parameter.

When using htx_xfer_blks(), transfering partial headers or trailers are
prevented when restricted by the <count> parameter. However using the
<mark> parameter will still allow to do it.

This patch changes the behavior by checking the <mark> type only after
checking the headers/trailers type, so we can still rollback on partial
transfer.

No impact on the current code, which does not try to do that yet.

commit | commitdiff | tree

Amaury Denoyelle [Thu, 30 Jan 2025 15:24:10 +0000 (16:24 +0100)]

BUILD: quic: remove GCC undefined error in qc_release_lost_pkts()

Every once in a while, GCC reports issues with qc_release_lost_pkts()
function. It seems that its static analysis is foiled by the code
structuring. The latest warning reports the following issue :

  CC      src/quic_loss.o
src/quic_loss.c: In function ‘qc_release_lost_pkts’:
src/quic_loss.c:313:58: error: potential null pointer dereference [-Werror=null-dereference]
  313 |                         unsigned int period = newest_lost->time_sent_ms - oldest_lost->time_sent_ms;
      |                                               ~~~~~~~~~~~^~~~~~~~~~~~~~

To fix definitely this, change slightly the code. <oldest_lost> and
<newest_lost> are now initialized on the first list entry outside of the
loop. This is enough to guarantee to GCC that they cannot be NULL for
the remainder of the function.

commit | commitdiff | tree

William Lallemand [Fri, 31 Jan 2025 14:23:47 +0000 (15:23 +0100)]

DOC: htx: clarify <mark> parameter for htx_xfer_blks()

Clarify the fact that the first <mark> block is transferred before
stopping when using htx_xfer_blks()

commit | commitdiff | tree

William Lallemand [Fri, 31 Jan 2025 13:41:28 +0000 (14:41 +0100)]

BUG/MEDIUM: htx: wrong count computation in htx_xfer_blks()

When transfering blocks from an src to another dst htx representation,
htx_xfer_blks() decreases the size of each block removed from the <count>
value passed in parameter, so it can't transfer more than <count>. The
size must also contains the metadata, represented by a simple
sizeof(struct htk_blk).

However, the code was doing a sizeof(dstblk) instead of a
sizeof(*dstblk) which as the consequence of removing only a size_t from
count. Fortunately htx_blk size is 64bits, so that does not provoke any
problem in 64bits. But on 32bits architecture, the count value is not
decreased correctly and the function could try to transfer more blocks
than allowed by the count parameter.

Must be backported in every stable release.

commit | commitdiff | tree

Christopher Faulet [Thu, 30 Jan 2025 15:19:48 +0000 (16:19 +0100)]

MINOR: tevt/dev: Parse tuple of termination events

term_events tool is now able to parse tuple of termination events, as returned
by "term_events" sample fetch function.

commit | commitdiff | tree

Christopher Faulet [Thu, 30 Jan 2025 10:31:59 +0000 (11:31 +0100)]

MINOR: tevt/connection: Add support for POLL_HUP/POLL_ERR events

Connection errors can be detected via connect/recv/send syscall, but also
because it was reported by the poller. So dedicated events, at the FD level,
are introduced to make the difference.

term_events tool was updated accordingly.

commit | commitdiff | tree

Christopher Faulet [Tue, 21 Jan 2025 17:25:25 +0000 (18:25 +0100)]

MINOR: tevt/dev: Add term_events tool

This development tool can be used to convert a string representing a
termination event logs to its human redable representation. Several string
may be converting at a time. To do so, several arguments can be specified on
the commeand line or they can be provided on STDIN, using "-" argument.

Here is an exemple:

  > term_events f2x2f4x4 m2m4m1 e2e1 s2s1S1 E1 M1 F1
  ### f2x2f4x4 : fd:shutr > xprt:shutr > fd:snd_err > xprt:snd_err
  ### m2m4m1   : muxc:shutr > muxc:snd_err > muxc:shutw
  ### e2e1     : se:eos > se:shutw
  ### s2s1S1   : strm:eos > strm:shutw > STRM:shutw
  ### E1       : SE:shutw
  ### M1       : MUXC:shutw
  ### F1       : FD:shutw

The make target "dev/term_events/term_events" must be used to compile it.

commit | commitdiff | tree

Christopher Faulet [Tue, 21 Jan 2025 17:19:50 +0000 (18:19 +0100)]

REORG: tevt/connection: Move enums at the end of the header file

Enums used to report events were placed in the connection header for
conveniance. But it is not specifically related to connection. So, they are
moved at the end of the file to have a better isolation.

commit | commitdiff | tree

Christopher Faulet [Tue, 21 Jan 2025 17:16:27 +0000 (18:16 +0100)]

MINOR: tevt: Improve function to convert a termination events log to string

The function is now responsible to handle empty log because no event was
reported. In that case, an empty string is returned. It is also responsible to
handle case where termination events log is not supported for an given entity
(for instance the quic mux for now). In that case, a dash ("-") is returned.

commit | commitdiff | tree

Christopher Faulet [Tue, 21 Jan 2025 06:46:17 +0000 (07:46 +0100)]

MINOR: tevt: Add a sample to get termination events for all locations

"term_events" is a sample fetche function that can be used to get
termination events for all locations in one call. The format equivalent to:

{fc_term_events,fc_mux_term_events,fs.term_events,txn.term_events,bs.term_events,bc_mux_term_events,bc_term_events}

If no event was reported for a location, the field is empty. If the feature
is not supported yet, a dash ('-') is printed.

commit | commitdiff | tree

Christopher Faulet [Tue, 21 Jan 2025 06:41:33 +0000 (07:41 +0100)]

MINOR: tevt/applet: Add limited support for termination event logs for applets

There is no termination events log for applet but events for the SE location
are filled when the endpoint is an applet. Most of them relies on the new
applet API. Only few events are reported for legacy applets.

commit | commitdiff | tree

Christopher Faulet [Mon, 20 Jan 2025 14:35:47 +0000 (15:35 +0100)]

MINOR: tevt: Don't duplicate termination event during reporting

It is hard to never detect the same event several time without painful
tests. In other words, the same termination event can be reported several
time and this must be handled. To do so, "tevt_report_event" macro is
updated to ignore an event if the last reported one is of the same type, for
the same location. Of course, if the same event is reported several times at
different moment, it will not be detected.

commit | commitdiff | tree

Christopher Faulet [Mon, 20 Jan 2025 08:00:05 +0000 (09:00 +0100)]

MEDIUM: tevt/stconn/stream: Add dedicated termination events for stream location

If it is the last patch to introduce dedicated termination events for each
location. In this one, events for the stream location are introcued. The old
enum is also removed because it is now unused.

Here, more accurate evets are added. The "intercepted" event was splitted.

commit | commitdiff | tree

Christopher Faulet [Mon, 20 Jan 2025 07:57:55 +0000 (08:57 +0100)]

MINOR: tevt/stconn: Be more accurate to report shutw events

In se_shutdown() a SE termination event is reported while the shutw stream
event is reported in sc_app_shut_conn().

commit | commitdiff | tree

Christopher Faulet [Mon, 20 Jan 2025 07:52:46 +0000 (08:52 +0100)]

MEDIUM: tevt/muxes: Add dedicated termination events for muxc/se locations

Termination events dedicated to mux connection and stream-endpoint
descriptors are added in this patch. Specific events to these locations are
thus added. Changes for the H1 and H2 multiplexers are reviewed to be more
accurate.

commit | commitdiff | tree

Christopher Faulet [Mon, 20 Jan 2025 07:35:36 +0000 (08:35 +0100)]

MINOR: tevt/connection: Add dedicated termination events for lower locations

To be able to add more accurate termination events for each location, the
enum will be splitted by location. Indeed, there are at most 16 possbile
events. It will be pretty confusing to use same termination events for the
different locations. So the best is to split them.

In this patch, the termination events for the fd, hs and xprt locations are
introduced. For now some holes are added to keep similar events aligned
across enums. But this may change in future.

commit | commitdiff | tree

Christopher Faulet [Mon, 23 Dec 2024 14:48:36 +0000 (15:48 +0100)]

MINOR: tevt/mux-pt: Add support for termination event logs

A termination event logs is added to the mux-pt context and appropriate
events are reported for the muxc location. There is no SE events for this
mux.

commit | commitdiff | tree

Christopher Faulet [Mon, 23 Dec 2024 13:56:33 +0000 (14:56 +0100)]

MINOR: tevt/muxes: Add CTL and SCTL command to get the termination event logs

MUX_CTL_TEVTS command is added to get the termination event logs of a mux
connection and MUX_SCTL_TEVTS command to get the termination event logs of a
mux stream.

commit | commitdiff | tree

Christopher Faulet [Mon, 23 Dec 2024 13:52:15 +0000 (14:52 +0100)]

MINOR: tevt/mux-h1/mux-h2: Add termination events log when dumping mux info

The termiantion events logs of the multiplexer connection and stream are now
dumped when corresponding mux info are dumped. The termination event logs of
the underlying connection is also dumped in the debug string.

commit | commitdiff | tree

Christopher Faulet [Mon, 23 Dec 2024 13:39:58 +0000 (14:39 +0100)]

MINOR: tevt/conn: Report intercepted event for L4 rules

When a L4 rules interrupts the processing, a termination event is reported
for the connection, with the "fd" location.

commit | commitdiff | tree

Christopher Faulet [Mon, 23 Dec 2024 13:30:33 +0000 (14:30 +0100)]

MINOR: tevt/stream/stconn: Report termination events for stream and sc

In this patch, events for the stream location are reported. These events are
first reported on the corresponding stream-connector. So front events on scf
and back event on scb. Then all events are both merged in the stream. But
only 4 events are saved on the stream.

Several internal events are for now grouped with the type
"tevt_type_intercepted". More events will be added to have a better
resolution. But at least the place to report these events are identified.

For now, when a event is reported on a SC, it is also reported on the stream
and vice versa.

commit | commitdiff | tree

Christopher Faulet [Mon, 23 Dec 2024 13:25:05 +0000 (14:25 +0100)]

MINOR: tevt/mux-h2: Report termination events for the H2C

shutdown for reads (read0), receive errors, shutdown for writes and timeouts
are reported, but only for the H2 connection for now.

As for the H1 multiplexer, more events must be added to report protocol
errors, goaways and rst-streams. And of course, all events for the H2
streams must be reported too.

commit | commitdiff | tree

Christopher Faulet [Mon, 23 Dec 2024 10:27:25 +0000 (11:27 +0100)]

MINOR: tevt/mux-h1: Report termination events for the H1C and H1S

shutdown for reads (read0), receive errors, shutdown for writes and timeouts
are reported. It is not too hard to know where to report events generated by
HAProxy (timeouts and shutw). For detected events (shutr and receive error),
it is not so simple. These events must not be reported when they are
detected but when the mux can handle them. For instance, some unprocessed
input data may block a read0. So, the experience will tell us if these
events are reported at the rigth time and on the right conditions.

For now, no internal errors (parsing errors, protocol errors, intenral
errors...) are reported because these event types have not yet been added.

commit | commitdiff | tree

Christopher Faulet [Mon, 23 Dec 2024 10:24:29 +0000 (11:24 +0100)]

MINOR: tevt/stconn: Add a termination events log in the SE descriptor

This termination events log will be used to report events from the mux
streams. The location will be "tevt_loc_se" and the muxes will be
responsible to report the corresponding events.

commit | commitdiff | tree

Christopher Faulet [Mon, 23 Dec 2024 09:43:54 +0000 (10:43 +0100)]

MINOR: tevt: Add the termination events log's fundations

Termination events logs will be used to report the events that led to close
a connection. Unlike flags, that reflect a state, the idea here is to store
a log to preserve the order of the events. Most of time, when debugging an
issue, the order of the events is crucial to be able to understand the root
cause of the issue. The traces are trully heplful to do so. But it is not
always possible to active them because it is pretty verbose. On heavily
loaded platforms, it is not acceptable. We hope that the termination events
logs will help us in that situations.

One termination events log will be be store at each layer (connection, mux
connection, mux stream...) as a 32-bits integer. Each event will be store on
8 bits, 4 bits for the location and 4 bits for the type. So the first four
events will be stored only for each layer. It should be enough why a
connection is closed.

In this patch, the enums defining the termination event locations and types
are added. The macro to report a new event is also added and a function to
convert a termination events log to a string that could be display in log
messages for instance.

commit | commitdiff | tree

Christopher Faulet [Mon, 23 Dec 2024 10:42:08 +0000 (11:42 +0100)]

BUG/MINOR: mux-h1: Only report a SE error on demux error

When a demux error is reported by the H1S, an error must be reported on the
SE and not an end-of-input or an end-of-stream. So SE_FL_ERROR flag must be
set and not SE_FL_EOI/SE_FL_EOS.

It seems this bug has no impact. So there is no reason to backport it.

commit | commitdiff | tree

Christopher Faulet [Mon, 23 Dec 2024 10:20:35 +0000 (11:20 +0100)]

MINOR: mux-h1: Add masks to group H1S DEMUX and MUX errors

It is just a small patch to clean up mux/demux functions. Instead of listing
the H1S errors that must be handled during demux of mux operations, masks of
flags are used. It is more readable.

commit | commitdiff | tree

Willy Tarreau [Thu, 30 Jan 2025 15:32:35 +0000 (16:32 +0100)]

MEDIUM: epoll: skip reports of stale file descriptors

Now that we can see that some events are reported for older instances
of a file descriptor, let's skip these ones instead of reporting
dangerous events on them. It might possibly qualify as a bug if it
helps fixing strange issues in certain environments, in which case it
can make sense to backport it along with the following recent patches:

  DEBUG: fd: add a counter of takeovers of an FD since it was last opened
  MINOR: fd: add a generation number to file descriptors
  DEBUG: epoll: store and compare the FD's generation count with reported event

commit | commitdiff | tree

Willy Tarreau [Thu, 30 Jan 2025 15:28:33 +0000 (16:28 +0100)]

DEBUG: epoll: store and compare the FD's generation count with reported event

There have been some reported cases where races between threads in epoll
were causing wrong reports of close or error events. Since the epoll_event
data is 64 bits, we can store the FD's generation counter in the upper
bits to verify if we're speaking about the same instance of the FD as the
current one or a stale one. If the generation number does not match, then
we classify these into 3 conditions and increment the relevant COUNT_IF()
counters (stale report for closed FD, stale report of harmless event on
reopened FD, stale report of HUP/ERR on reopened FD). Tests have shown that
with heavy concurrency, a very small maxconn (typically 1 per thread),
http-reuse always and a server closing connections first but randomly
(httpterm with /C=2r), such events can happen at a pace of a few per second
for the closed FDs, and a few per minute for the other ones, so there's value
in leaving this accessible for troubleshooting. E.g after a few minutes:

  Count     Type Location function(): "condition" [comment]
  5541       CNT ev_epoll.c:296 _do_poll(): "1" [epoll report of event on a just closed fd (harmless)]
  10         CNT ev_epoll.c:294 _do_poll(): "1" [epoll report of event on a closed recycled fd (rare)]
  42         CNT ev_epoll.c:289 _do_poll(): "1" [epoll report of HUP on a stale fd reopened on the same thread (suspicious)]
  212        CNT ev_epoll.c:279 _do_poll(): "1" [epoll report of HUP/ERR on a stale fd reopened on another thread (harmless)]
  1          CNT mux_h1.c:3911 h1_send(): "b_data(&h1c->obuf)" [connection error (send) with pending output data]

This one with the following setup, whicih abuses threads contention by
starting 64 threads on two cores:
- config:
    global
        nbthread 64
        stats socket /tmp/sock1 level admin
        stats timeout 1h
    defaults
        timeout client 5s
        timeout server 5s
        timeout connect 5s
        mode http
    listen p2
        bind :8002
        http-reuse always
        server s1 127.0.0.1:8000 maxconn 4

- haproxy forcefully started on 2C4T:

    $ taskset -c 0,1,4,5 ./haproxy -db -f epoll-dbg.cfg

- httpterm on port 8000, cpus 2,3,6,7 (2C4T)

- h1load with responses larger than a single buffer, and randomly
  closing/keeping alive:

    $ taskset -c 2,3,6,7 h1load -e -t 4 -c 256 -r 1 0:8002/?s=19k/C=2r

commit | commitdiff | tree

Willy Tarreau [Thu, 30 Jan 2025 15:25:40 +0000 (16:25 +0100)]

MINOR: fd: add a generation number to file descriptors

This patch adds a counter of close() on file descriptors in the fdtab.
The goal is to better detect if reported events concern the current or
a previous file descriptor. For now the counter is only added, and is
showed in "show fd" as "gen". We're reusing unused space at the end of
the struct. If it's needed for something more important later, this
patch can be reverted.

commit | commitdiff | tree

Willy Tarreau [Thu, 30 Jan 2025 14:59:11 +0000 (15:59 +0100)]

DEBUG: fd: add a counter of takeovers of an FD since it was last opened

That's essentially in order to help with debugging strange cases like
the occasional epoll issues/races, by keeping a counter of how many
times an FD was taken over since last inserted. The room is available
so let's use it. If it's needed later, this patch can easily be reverted.
The counter is also reported in "show fd" as "tkov".

commit | commitdiff | tree

Amaury Denoyelle [Thu, 30 Jan 2025 17:01:53 +0000 (18:01 +0100)]

BUILD: quic: fix overflow in global tune

A new global option was recently introduced to disable pacing. However,
the value used (1<<31) caused issue with some compiler as options field
used for storage is declared as int. Move pacing deactivation flag
outside into the newly defined quic_tune to fix this.

This should be backported up to 3.1 after a period of observation. Note
that it relied on the previous patch which defined new quic_tune type.

commit | commitdiff | tree

Amaury Denoyelle [Thu, 30 Jan 2025 16:58:20 +0000 (17:58 +0100)]

MINOR: quic: define quic_tune

Define a new structure quic_tune. It will be useful to regroup various
configuration settings and tunable related to QUIC, instead of defining
them into the global structure.

commit | commitdiff | tree

Amaury Denoyelle [Thu, 30 Jan 2025 13:57:27 +0000 (14:57 +0100)]

MINOR: quic: mark BBR as stable

Pacing has recently been moved out of experimental status and is
activated by default. This is a mandatory requirement for BBR.
Furthermore, BBR is now considered stable. As such, removes its
experimental status with this commit.

commit | commitdiff | tree

Amaury Denoyelle [Thu, 30 Jan 2025 13:56:35 +0000 (14:56 +0100)]

MAJOR: quic: mark pacing as stable and enable it by default

Remove pacing experimental status, so it's not required anymore to use
expose-experimental-directives to enable it.

Along this change, pacing is now activated by default. As such, pacing
configuration is transformed into its final form. The global on/off
setting is turned into a disable setting without argument.

commit | commitdiff | tree

Amaury Denoyelle [Thu, 30 Jan 2025 13:50:19 +0000 (14:50 +0100)]

MINOR: quic: transform pacing settings into a global option

Pacing support was previously activated on each bind line individually,
via an optional argument of quic-cc-algo keyword. Remove this optional
argument and introduce a global setting to enable/disable pacing. Pacing
activation is still flagged as experimental.

One important change is that previously BBR usage automatically
activated pacing support. This is not the case anymore, so users should
now always explicitely activate pacing if BBR is selected. A new warning
message will be displayed if this is not the case.

Another consequence of this change is that now pacing_inter callback is
always defined for every quic_cc_algo types. As such, QUIC MUX uses
global.tune.options to determine if pacing is required.

This should be backported up to 3.1, after a period of observation.

commit | commitdiff | tree

Amaury Denoyelle [Thu, 30 Jan 2025 10:58:27 +0000 (11:58 +0100)]

MINOR: quic: allow BBR testing without pacing

Pacing is activated per bind line via an optional boolean argument of
quic-cc-algo keyword. Contrary to the default usage, pacing is
automatically activated when BBR is chosen. This is because this
algorithm is expected to run on top of pacing, else its behavior is
undefined.

Previously, pacing argument was thus ignored when BBR was selected.
Change this to support explicit deactivation of pacing with it. This
could be useful to test BBR without pacing when debugging some issues.

This should be backported up to 3.1, after a period of observation.

commit | commitdiff | tree

Amaury Denoyelle [Thu, 30 Jan 2025 10:57:55 +0000 (11:57 +0100)]

MINOR: quic: remove references to burst in quic-cc-algo parsing

Pacing activation configuration has been recently revamped. Previously,
pacing related quic-cc-algo argument was used to specify a burst size.
It evolved into a boolean value as burst size is dynamically calculated
now. As such, removes any references to the old burst value in config
parsing code for cleaner code.

This should be backported up to 3.1, after a period of observation.

commit | commitdiff | tree

Willy Tarreau [Wed, 29 Jan 2025 16:51:12 +0000 (17:51 +0100)]

BUG/MEDIUM: chunk: make sure to flush the trash pool before resizing

Late in 3.1 we've added an integrity check to make sure we didn't keep
trash objects allocated before resizing the trash with commit 0bfd36e7b8
("MINOR: chunk: add a BUG_ON upon the next init_trash_buffer()"), but
it turns out that the counter that is being checked includes the number
of objects left in local thread caches. As such it can trigger despite
no object being allocated. This precisely happens when setting
tune.memory.hot-size to a few megabytes because some temporarily used
trash objects will remain in cache.

In order to address this, let's first flush the pool before running
the check. That was previously done by pool_destroy() but the check
had to be inserted before it. So now we first flush the trash pool,
then verify it's no longer used, and finally we can destroy it.

This needs to be backported to 3.1. Thanks to Christian Ruppert for
reporting this bug.

commit | commitdiff | tree

William Lallemand [Tue, 28 Jan 2025 19:55:20 +0000 (20:55 +0100)]

BUILD: ssl: more cleaner approach to WolfSSL without renegotiation

Patch discussed in https://github.com/wolfSSL/wolfssl/issues/6834

When building Wolfssl without renegotiation options, WolfSSL still
defines the macros about it, which warns during the build.

This patch completes the previous one by undefining the macros so
haproxy could build without any warning.

commit | commitdiff | tree

William Lallemand [Tue, 28 Jan 2025 17:27:31 +0000 (18:27 +0100)]

BUILD: ssl: allow to build without the renegotiation API of WolfSSL

In ticket https://github.com/wolfSSL/wolfssl/issues/6834, it was
suggested to push --enable-haproxy within --enable-distro.

WolfSSL does not want to include the renegotiation support in
--enable-distro.

To achieve this, let haproxy build without SSL_renegotiate_pending()
when wolfssl does not define HAVE_SECURE_RENEGOCIATION or
HAVE_SERVER_RENEGOCIATION_INFO.

Mirror of https://github.com/haproxy/haproxy.git