git.ipfire.org Git - thirdparty/haproxy.git/log

BUG/MEDIUM: applet: State appctx have more data if its EOI/EOS/ERROR flag is set

It is an harmless bug for now because only stats and cache applets are using
their own buffers and it is not possible to trigger this bug with these
applets. However, it remains important to try a receive if EOI, EOS or ERROR
is reached by the applet while no data was produced. Otherwise, it is not
possible to ack these events at the SE level.

No backport needed.

MINOR: applet: Always use applet API to set appctx flags

Some appctx flags were still set manually while there is a dedicated
function to do so. Be sure to always use applet_fl_set() to set appctx
flags.

MEDIUM: cli/applet: Stop to test opposite SC in I/O handler of CLI commands

The main CLI I/O handle is responsible to interrupt the processing on
shutdown/abort. It is not the responsibility of the I/O handler of CLI
commands to take care of it.

MEDIUM: applet: Handle applets with their own buffers in put functions

applet_putchk() and other similar functions are now testing the applet's
type to use the applet's outbuf instead of the channel's buffer. This will
ease applets convertion because most of them relies on these functions.

MEDIUM: buf: Add b_getline() and b_getdelim() functions

These functions are very similar to co_getline() and co_getdelim(). The
first one retrieves the longest part of the buffer that is composed
exclusively of characters not in the a delimiter set. The second one stops
on LF only and thus returns a line.

MEDIUM: stream: Use generic version to perform sync receives and sends

Instead of using connection versions, we now use generic versions. It means
we will also perfom sync receives and sync sends on applets too, but only
for applets using their own buffers. Old applets are not concerned.

MINOR: sc_strm: Add generic version to perform sync receives and sends

sc_sync_recv() and sc_sync_send() were added to use connection or applet
versions, depending on the endpoint type. For now these functions are not
used. But this will be used by process_stream() to replace the connection
version.

BUG/MINOR: cli: Report an error to user if command or payload is too big

Too big command, larger than a buffer, was silently rejected by the CLI
applet. It was handled as an error and the connection was closed, but no
error message was reported to user to notify him. Now an error is reported
before closing. It is only displayed if the chunk buffer used by the CLI
applet is full and no delimiter (\n or ;) is found to mark the end of the
command. It works for a simple command but also for a command with a huge
payload.

This patch could be backported to all stable versions.

REGTESTS: Fix script about OCSP update compatibility tests

There were two occurrences of the seventh test. I don't know really why, but
this triggered a VTC error:

---- h7 Assert error in _assert_VSB_state(), lib/vsb.c line 104: Condition((s->s_flags & 0x00020000) == state) not true. Errno=0 Success

Renumbering tests fixes the script.

MINOR: server: allow cookie for dynamic servers

This commit allows "cookie" keyword for dynamic servers. After code
review, nothing was found which could prevent a dynamic server to use
it. An extra warning is added under cli_parse_add_server() if cookie
value is ignored due to a non HTTP backend.

This patch is not considered a bugfix. However, it may backported if
needed as its impact seems minimal.

BUG/MINOR: server: fix persistence cookie for dynamic servers

When adding a server dynamically, we observe that when a backend has a
dynamic persistence cookie, the new server has no cookie as we receive
the following HTTP header:
set-cookie: test-cookie=; Expires=Thu, 01-Jan-1970 00:00:01 GMT; path=/
Whereas we were expecting to receive something like the following, which
is what we receive for a server added in the config file:
set-cookie: test-cookie=abcdef1234567890; path=/

After investigating code path, srv_set_dyncookie() is never called when
adding a server through CLI, it is only called when parsing config file
or using "set server bkd1/srv1 addr".

To fix this, call srv_set_dyncookie() inside cli_parse_add_server().

This patch must be backported up to 2.4.

BUG/MINOR: server: reject enabled for dynamic server

Since their first implementation, dynamic servers are created into
maintenance state. This has been done purposely to avoid immediate
activation of a newly inserted server.

However, this principle is incompatible if "enabled" keyword is used on
"add server". The newly created instance will be unreacheable as proxy
load-balancing algorithm is not informed of its presence via
srv_lb_propagate(). The new server could be unblocked by toggling its
state with "disable server" / "enable server" commands, which will
trigger srv_lb_propagate() invocation.

To avoid this unexpected state, simply forbid "enabled" keyword for
dynamic servers. In the long-term, it could be possible to re authorize
it but at least this requires to call srv_lb_propagate() on dynamic
server creation.

This should fix github issue #2497.

This patch should not be backported as-is, to avoid breaking dynamic
servers API on stable versions. "enabled" should instead be ignored for
them. This will be implemented in a dedicated patch on top of 2.9.

REGTESTS: ssl: Add functional test for global ocsp-update option

Add tests for the 'tune.ssl.ocsp-update.mode' global option that can be
used to enable ocsp auto update on all certificates.

REGTESTS: ssl: Add OCSP update compatibility tests

Add tests that focus on the incompatibility checks on ocsp-update mode.
This test will only call "haproxy -c" on multiple configurations that
combine the crt-list 'ocsp-update' option and the global
'tune.ssl.ocsp-update.mode'.

MEDIUM: ssl: Add 'tune.ssl.ocsp-update.mode' global option

This option can be used to set a default ocsp-update mode for all
certificates of a given conf file. It allows to activate ocsp-update on
certificates without the need to create separate crt-lists. It can still
be superseded by the crt-list 'ocsp-update' option. It takes either "on"
or "off" as value and defaults to "off".
Since setting this new parameter to "on" would mean that we try to
enable ocsp-update on any certificate, and also certificates that don't
have an OCSP URI, the checks performed in ssl_sock_load_ocsp were
softened. We don't systematically raise an error when trying to enable
ocsp-update on a certificate that does not have an OCSP URI, be it via
the global option or the crt-list one. We will still raise an error when
a user tries to load a certificate that does have an OCSP URI but a
missing issuer certificate (if ocsp-update is enabled).

BUG/MINOR: ssl: Detect more 'ocsp-update' incompatibilities

The inconsistencies in 'ocsp-update' parameter were only checked when
parsing a crt-list line so if a certificate was used on a bind line
after being used in a crt-list with 'ocsp-update' set to 'on', then no
error would be raised. This patch helps detect such inconsistencies.

This patch can be backported up to branch 2.8.

BUG/MINOR: ssl: Wrong ocsp-update "incompatibility" error message

In a crt-list such as the following:
foo.pem [ocsp-update off] foo.com
foo.pem bar.com
we would get a wrong "Incompatibilities found in OCSP update mode ..."
error message during init when the two lines are actually saying the
same thing since the default for 'ocsp-update' option is 'off'.

This patch can be backported up to branch 2.8.

[RELEASE] Released version 3.0-dev6

Released version 3.0-dev6 with the following main changes :
    - MINOR: mux-h2: always use h2c_report_glitch()
    - MEDIUM: mux-h2: allow to set the glitches threshold to kill a connection
    - MINOR: quic: simplify rescheduling for handshake
    - MINOR: quic: remove qc_treat_rx_crypto_frms()
    - DOC: configuration: clarify ciphersuites usage (V2)
    - MINOR: tools: use public interface for FreeBSD get_exec_path()
    - BUG/MINOR: ssl: fix possible ctx memory leak in sample_conv_aes_gcm()
    - BUG/MINOR: ssl: do not set the aead_tag flags in sample_conv_aes_gcm()
    - BUG/MINOR: server: fix first server template not being indexed
    - MEDIUM: ssl: initialize the SSL stack explicitely
    - MEDIUM: ssl: allow to change the OpenSSL security level from global section
    - CLEANUP: ssl: remove useless #ifdef in openssl-compat.h
    - CI: github: add -DDEBUG_LIST to the default builds
    - BUG/MINOR: hlua: segfault when loading the same filter from different contexts
    - BUG/MINOR: hlua: missing lock in hlua_filter_new()
    - BUG/MINOR: hlua: fix missing lock in hlua_filter_delete()
    - DEBUG: lua: precisely identify if stream is stuck inside lua or not
    - MINOR: hlua: use accessors for stream hlua ctx
    - BUG/MEDIUM: hlua: streams don't support mixing lua-load with lua-load-per-thread (2nd try)
    - MINOR: debug: enable insecure fork on the command line
    - CI: github: add -dI to haproxy arguments
    - BUG/MINOR: listener: Wake proxy's mngmt task up if necessary on session release
    - BUG/MINOR: listener: Don't schedule frontend without task in listener_release()
    - MINOR: session: rename private conns elements
    - BUG/MAJOR: server: do not delete srv referenced by session
    - BUG/MEDIUM: spoe: Don't rely on stream's expiration to detect processing timeout
    - BUG/MINOR: spoe: Be sure to be able to quickly close IDLE applets on soft-stop
    - MAJOR: spoe: Deprecate the SPOE filter
    - MINOR: cfgparse: Add a global option to expose deprecated directives
    - MINOR: spoe: Add SPOE filters in the exposed deprecated directives
    - CLEANUP: assorted typo fixes in the code and comments
    - CI: temporarily adjust kernel entropy to work with ASAN/clang
    - BUG/MEDIUM: spoe: Return an invalid frame on recv if size is too small
    - BUG/MINOR: session: ensure conn owner is set after insert into session
    - BUG/MEDIUM: http_ana: ignore NTLM for reuse aggressive/always and no H1
    - BUG/MAJOR: connection: fix server used_conns with H2 + reuse safe
    - BUG/MAJOR: ocsp: Separate refcount per instance and per store
    - REGTESTS: ssl: Add OCSP related tests
    - BUG/MEDIUM: ssl: Fix crash when calling "update ssl ocsp-response" when an update is ongoing
    - BUG/MEDIUM: ssl: Fix crash in ocsp-update log function
    - MEDIUM: ssl: Change output of ocsp-update log
    - MINOR: ssl: Change level of ocsp-update logs
    - CLEANUP: ssl: Remove undocumented ocsp fetches
    - REGTESTS: ssl: Add checks on ocsp-update log format
    - MINOR: connection: implement conn_release()
    - MINOR: connection: extend takeover with release option
    - MEDIUM: server: close idle conn on server deletion
    - MEDIUM: mux: prepare for takeover on private connections
    - MEDIUM: server: close private idle connection before server deletion
    - BUG/MINOR: mux-quic: close all QCS before freeing QCC tasklet
    - BUG/MEDIUM: mux-fcgi: Properly handle EOM flag on end-of-trailers HTX block
    - BUILD: server: fix build regression on old compilers (<= gcc-4.4)
    - OPTIM: http_ext: avoid useless copy in http_7239_extract_{ipv4,ipv6}
    - MINOR: debug: add "debug dev trace" to flood with traces
    - MINOR: atomic: add a read-specific variant of __ha_cpu_relax()
    - MINOR: applet: add new function applet_append_line()
    - MINOR: log/applet: add new function syslog_applet_append_event()
    - MEDIUM: ring/sink: use applet_append_line()/syslog_applet_append_event() for readers
    - REORG: dns/ring: split the ring between the generic one and the DNS one
    - MEDIUM: ring: move the ring reader code to ring_dispatch_messages()
    - MEDIUM: sink: move the generic ring forwarder code use ring_dispatch_messages()
    - MEDIUM: log/sink: make the log forwarder code use ring_dispatch_messages()
    - MINOR: buf: add b_add_ofs() to add a count to an absolute position
    - MINOR: buf: add b_rel_ofs() to turn an absolute offset into a relative one
    - MINOR: buf: add b_putblk_ofs() to copy a block at a specific position
    - MINOR: buf: add b_getblk_ofs() that works relative to area and not head
    - MINOR: ring: make the ring reader use only absolute offsets
    - MINOR: ring: reserve one special value for the readers count
    - MINOR: vecpair: add new vector pair based data manipulation mechanisms
    - MINOR: vecpair: add necessary functions to use vecpairss from/to ring APIs
    - MINOR: ring: rename totlen vs msglen in ring_write()
    - MINOR: ring: add ring_data() to report the amount of data in a ring
    - MINOR: ring: add ring_size() to return the ring's size
    - MINOR: ring: add ring_dup() to copy a ring into another one
    - MINOR: ring: also add ring_area(), ring_head(), ring_tail()
    - MINOR: ring: make callers use ring_data() and ring_size(), not ring->buf
    - MINOR: errors: use ring_dup() to duplicate the startup_logs
    - MINOR: ring: use ring_size(), ring_area(), ring_head() and ring_tail()
    - MINOR: ring: add a flag to indicate a mapped file
    - MAJOR: ring: insert an intermediary ring_storage level
    - MINOR: ring: resize only under thread isolation
    - MINOR: ring: allow to reduce a ring size
    - MEDIUM: ring: replace the buffer API in ring_write() with the vec<->ring API
    - MEDIUM: ring: change the ring reader to use the new vector-based API now
    - MEDIUM: ring: remove the struct buffer from the ring
    - MEDIUM: ring: align the head and tail fields in the ring_storage structure
    - MINOR: ring: make the reader check the readers count before inc/dec
    - MEDIUM: ring: lock the tail's readers counters before proceeding with the changes
    - MEDIUM: ring: protect the reader's positions against writers
    - MEDIUM: ring: use the topmost bit of the tail as a lock
    - MEDIUM: move the ring's lock to only protect the readers list
    - MEDIUM: ring: unlock the ring's tail earlier
    - MINOR: ring: don't take the readers lock if there are no readers
    - MEDIUM: ring/applet: turn the wait_entry list to an mt_list instead
    - MEDIUM: ring: protect the initialization of the initial reader offset
    - MINOR: ring: make sure ring_dispatch waits when facing a changing message
    - MAJOR: ring: drop the now unneeded lock
    - OPTIM: ring: don't even try to update offset when failed to read
    - OPTIM: ring: have only one thread at a time wake up all readers
    - MINOR: ring: keep a few frequently used pointers in the local stack
    - MINOR: ring: add the definition of a ring waiting cell
    - MINOR: ring: make the number of queues configurable
    - MAJOR: ring: implement a waiting queue in front of the ring
    - MEDIUM: ring: significant boost in the loop by checking the ring queue ptr first
    - MEDIUM: ring: improve speed in the queue waiting loop on x86_64
    - MINOR: ring: simplify the write loop a little bit
    - CLEANUP: ring: further simplify the write loop
    - MINOR: ring: it's not x86 but all non-ARMv8.1 which needs the read before OR
    - MINOR: ring: avoid writes to cells during copy
    - OPTIM: ring: use relaxed stores to release the threads
    - CLEANUP: ring: use only curr_cell and not next_cell in the main write loop
    - BUILD: ssl: fix build error on older compilers with openssl-3.2
    - BUG/MINOR: server: 'source' interface ignored from 'default-server' directive
    - BUG/MAJOR: ring: free the ring storage not the ring itself when using maps

BUG/MAJOR: ring: free the ring storage not the ring itself when using maps

A recent issue was uncovered by the CI which started to randomly report
segfaults on a few tests, and more systematically on FreeBSD. It turn
out that it was introduced by recent commit 03816ccfa9 ("MAJOR: ring:
insert an intermediary ring_storage level"), which overlooked the munmap()
path of the sink and startup logs: once the ring and its storage were
split, it was no longer correct to munmap() the ring, only its storage
area needs to be unmapped, and the ring must always be freed separately.

Thanks to Christopher and William for their help at trying to reproduce
it and figure the circumstances that triggers it.

No backport is needed.

BUG/MINOR: server: 'source' interface ignored from 'default-server' directive

Sebastien Gross reported that 'interface' keyword ('source' subargument)
is silently ignored when used from 'default-server' directive despite the
documentation implicitly stating that the keyword should be supported
there.

When support for 'source' keyword was added to 'default-server' directive
in dba97077 ("MINOR: server: Make 'default-server' support 'source'
keyword."), we properly duplicated the conn iface_name from the default-
server but we forgot to copy the conn iface_len which must be set as well
since it is used as setsockopt()'s 'optlen' argument in
tcp_connect_server().

It should be backported to all stable versions.

BUILD: ssl: fix build error on older compilers with openssl-3.2

OpenSSL 3.2 triggers the code part added by commit 25da217 ("MINOR: ssl:
Update ssl_fc_curve/ssl_bc_curve to use SSL_get0_group_name") which
contains a variable declaration in the for() statement and breaks on
older compilers, as reported in GH issues #2501.

Let's just declare it normally to fix the problem. This must be
backported wherever the commit above is (at least 2.9).

CLEANUP: ring: use only curr_cell and not next_cell in the main write loop

It turns out that we can reduce by one variable in the loop and this
clobbers one less register, making it slightly faster on Cortex A72.

OPTIM: ring: use relaxed stores to release the threads

We don't care in what order the threads are released, so we can write
their sent value using relaxed atomic stores. This brings a 3-5% perf
boost on ARM with 80 cores, reaching 7.25M/s, and doesn't change
anything on x86 since it keeps using strict ordering.

MINOR: ring: avoid writes to cells during copy

It has been found that performing a first pass consisting in copying
all messages, and a second one to notify about releases is more efficient
on AMD than updating all of them on the fly using a CAS, despite making
writers wait longer to be released.

Maybe it's related to the ability for the CPU to prefetch the contents
during a simple load while it wouldn't do it for an XCHG, it's unsure
at this point. This will also mater permit to use relaxed stores to
release threads.

On ARM the performance increased to 7.0M/s. If this patch is applied
before the dropping of the intermediary step, instead it drops to
3.9M/s. This shows the dependency between such changes that strive to
limit the number of writes on the fast path.

On x86_64, the EPYC at 3C6T saw a small drop from 4.57M to 4.45M, but
the 24C48T setup saw a nice 33% boost from 3.33M to 4.44M, i.e. we
get stable perf at 3 and 24 cores, despite having 8 CCX involved and
fighting with each other.

Other possibilities are:
  - use of HA_ATOMIC_XCHG() instead of FETCH_OR()
    => slightly faster (4.62/7.37 vs 4.58/7.34). Pb: requires to
       modify the readers to wait much longer since the tail value
       won't be valid in this case during updates, and it will have
       to wait by looping over it.
  - use other conditions to release a cell
    => to be tested

MINOR: ring: it's not x86 but all non-ARMv8.1 which needs the read before OR

Archs relying on CAS benefit from a read prior to FETCH_OR, so it's
not just x86 that benefits from this. Let's just change the condition
to only exclude __ARM_FEATURE_ATOMICS which is the only one faster
without.

CLEANUP: ring: further simplify the write loop

The loop was cleaned up a little bit so that the inner loops are more
readable and that the ifdef'd parts are whole blocks and not just an
"if" condition. A few conditions were adjusted to benefit from "break"
and "continue".

MINOR: ring: simplify the write loop a little bit

This is mostly a cleanup in that it turns the two-level loop into a
single one, but it also simplifies the code a little bit and brings
some performance savings again, which are mostly noticeable on ARM,
but don't change anything for x86.

MEDIUM: ring: improve speed in the queue waiting loop on x86_64

x86_64 doesn't have a native atomic FETCH_OR(), it's implemented using
a CAS, which will always cause a write cycle. Here we know we can just
wait as long as the lock bit is held so better loop on a load, and only
attempt the CAS on success. This requires a tiny ifdef and brings nice
benefits. This brings the performance back from 3.33M to 3.75M at 24C48T
while doing no change at 3C6T.

MEDIUM: ring: significant boost in the loop by checking the ring queue ptr first

By doing that and placing the cpu_relax at the right places, the ARM
reaches 6.0M/s on 80 threads. On x86_64, at 3C6T the EPYC sees a small
increase from 4.45M to 4.57M but at 24C48T it sees a drop from 3.82M
to 3.33M due to the write contention hidden behind the CAS that
implements the FETCH_OR(), that we'll address next.

MAJOR: ring: implement a waiting queue in front of the ring

The queue-based approach consists in forcing threads to wait away from
the work area so as not to disturb the current writer, and to prepare
the work by grouping them in a queue. The last arrived takes the head
of the queue by placing its preinitialized ring cell there, becomes the
queue's leader, informs itself about the amount of previously accumulated
bytes so that when its turn comes, it immediately knows how much room is
needed to be released.

It can then take the whole queue with it, leaving an empty one for new
threads to come while it's releasing the room needed to copy everything.

By doing so we're cascading contention areas so that multiple parts can
work in parallel.

Note that we must never leave a write counter set to 0xFF at tail, and
this happens when a message cannot fit and we give up, because in this
case we're writing back tail_ofs, and only later we restore the counter.

The solution here is to make a special case when we're going to drop
the messages, and to write the readers count before restoring tail.

This already shows a tremendous performance gain on ARM (385k -> 4.8M),
thanks to the fact that now all waiting threads wait on the queue's
head instead of polluting the tail lock. On x86_64, the EPYC sees a big
boost at 24C48T (1.88M -> 3.82M) and a slowdown at 3C6T (6.0->4.45)
though this one is much less of a concern as so few threads need less
bandwidth than bigger counts.

MINOR: ring: make the number of queues configurable

Now the rings have one wait queue per group. This should limit the
contention on systems such as EPYC CPUs where the performance drops
dramatically when using more than one CCX.

Tests were run with different numbers and it was showed that value
6 outperforms all other ones at 12, 24, 48, 64 and 80 threads on an
EPYC, a Xeon and an Ampere CPU. Value 7 sometimes comes close and
anything around these values degrades quickly. The value has been
left tunable in the global section.

This commit only introduces everything needed to set up the queue count
so that it's easier to adjust it in the forthcoming patches, but it was
initially added after the series, making it harder to compare.

It was also shown that trying to group the threads in queues by their
thread groups is counter-productive and that it was more efficient to
do that by applying a modulo on the thread number. As surprising as it
seems, it does have the benefit of well balancing any number of threads.

MINOR: ring: add the definition of a ring waiting cell

This is what will be used to describe one waiting thread, its message
in the queues, and the aggregation of pending messages after it.

MINOR: ring: keep a few frequently used pointers in the local stack

Code disassembly shows that ring->storage->tail and ring->queue are
accessed a lot and reloaded a lot due to aliasing. Let's just have
variables for them in the local stack. It makes the code smaller and
slightly faster.

OPTIM: ring: have only one thread at a time wake up all readers

It's inefficient and counter-productive that each ring writer iterates
over all readers to wake them up. Let's just have one in charge of this,
it strongly limits contention. The only thing is that since the thread
is iterating over a list, we want to be sure that if the first readers
have already completed their job, they will be woken up again. For this
we keep a counter of messages delivered after the wakeup started, and
the waking thread will check it before going back to sleep. In order to
avoid looping forever, it will also drop its waking flag soon enough to
possibly let another one take it.

There used to be a few cases of watchdogs before this on a 24-core AMD
EPYC platform on the list iteration those never appeared anymore.
The perf has dropped a bit on 3C6T on the EPYC, from 6.61 to 6.0M but
remains unchanged at 24C48T.

OPTIM: ring: don't even try to update offset when failed to read

If there's nothing to read, it's pointless for a reader to try to update
the offset pointer, that's two atomic ops to replace a value by itself
twice. Let's just stop this.

MAJOR: ring: drop the now unneeded lock

It was only used to protect the list which is now an mt_list so it
doesn't provide any required protection anymore. It obviously also
used to provide strict ordering between the writer and the reader
when the writer started to update the messages, but that's now
covered by the oredered tail updates and updates to the readers
count to protect the area.

The message rate on small thread counts (up to 12) saw a boost of
roughly 5% while on large counts while for large counts it lost
about 2% due to some contention now becoming visible elsewhere.
Typical measures are 6.13M -> 6.61M at 3C6T, and 1.88 -> 1.92M at
24C48T on the EPYC.

MINOR: ring: make sure ring_dispatch waits when facing a changing message

The writer is using tags 0xFF instead of readers count at the front of
messages that are undergoing an update, while the tail has already been
updated. The reader needs to take care of this because it can face these
messages and mistakenly parse data that's still being written, leading
to corruption (especially if this happens while the size is changing).

Let's just stop reading when facing reserved codes, since they indicate
that the end of usable messages was reached.

MEDIUM: ring: protect the initialization of the initial reader offset

Since we're going to remove the lock, there's no more way to prevent the
ring from being fed while we're attaching a client to it. We need to
freeze the buffer while looking at its head so that we can attach there
and have a trustable one. We could do it by setting the lock bit on the
tail offset but quite frankly we don't need to bother with that, attaching
a client is rare enough to permit a thread_isolate().

MEDIUM: ring/applet: turn the wait_entry list to an mt_list instead

Rings are keeping a lock only for the list, which apparently doesn't
need anything more than an mt_list, so let's first turn it into that
before dropping the lock. There should be no visible effect.

MINOR: ring: don't take the readers lock if there are no readers

There's no point looking for freshly attached readers if there are none,
taking this lock requires an atomic write to a shared area, something we
clearly want to avoid.

A general test with 213-byte messages on different thread counts shows
how the performance degrades across CCX and how this patch improves the
situation:
                   Before          After
    3C6T/1CCX:    6.39 Mmsg/s     6.35 Mmsg/s
   6C12T/2CCX:    2.90 Mmsg/s     3.16 Mmsg/s
  12C24T/4CCX:    2.14 Mmsg/s     2.33 Mmsg/s
  24C48T/8CCX:    1.75 Mmsg/s     1.92 Mmsg/s

This tends to confirm that the queues will really be needed and that
they'll have to be per-ccx hence per thread-group. They will amortize
the number of updates on head & tail (one per multiple messages).

MEDIUM: ring: unlock the ring's tail earlier

We know we can continue to protect the message area so we can unlock the
tail as soon as we know its new value. Now we're seeing ~6.4M msg/s vs
5.4M previously on 3C6T of a 3rd gen EPYC, and 1.88M vs 1.54M for 24C48T
threads, which is a significant gain!

This requires to carefully write the new head counter before releasing
the writers, and to change the calculation of the work area from
tail..head to tail...new_tail while writing the message.

MEDIUM: move the ring's lock to only protect the readers list

Now the lock is only taken around the readers list. With careful
ordering of writes to head/tail, the ring remains protected.

The perf is a bit better, though (1.54M msg/s vs 1.4M at 48T on
a 3rd gen EPYC, and 5.4M vs 5.3M for a 3C6T setup).

MEDIUM: ring: use the topmost bit of the tail as a lock

We're now locking the tail while looking for some room in the ring. In
fact it's still while writing to it, but the goal definitely is to get
rid of the lock ASAP. For this we reserve the topmost bit of the tail
as a lock, which may have as a possible visible effect that buffers will
be limited to 2GB instead of 4GB on 32-bit machines (though in practise,
good luck for allocating more than 2GB contiguous on 32-bit), but in
practice since the size is read with atol() and some operating systems
limit it to LONG_MAX unless passing negative numbers, the limit is
already there.

For now the impact on x86_64 is significant (drop from 2.35 to 1.4M/s
on 48 threads on EPYC 24 cores) but this situation is only temporary
so that changes can be reviewable and bisectable.

Other approaches were attempted, such as using XCHG instead, which is
slightly faster on x86 with low thread counts (but causes more write
contention), and forces readers to stall under heavy traffic because
they can't access a valid value for the queue anymore. A CAS requires
preloading the value and is les good on ARMv8.1. XADD could also be
considered with 12-13 upper bits of the offset dedicated to locking,
but that looks overkill.

MEDIUM: ring: protect the reader's positions against writers

The reader now needs to protect the positions it's reading. This is
already done via the readers counter at the beginning of messages,
but as long as the lock is present, this counter is decremented
before starting to parse messages, and incremented at the end.

We must now do that in reverse, first protect the end of the messages,
and only then remove ourselves from the already processed messages, so
that at no point could a writer pass over and possibly overwrite data
we're currently watching.

MEDIUM: ring: lock the tail's readers counters before proceeding with the changes

The goal here is to start to protect the writing area inside the area
itself so that we'll later be able to release the ring's lock. We're not
there yet, but at least the tail is marked as protected for as long as the
message is not fully written.

MINOR: ring: make the reader check the readers count before inc/dec

We'll want to reserve some special values for the readers count to
temporary lock the following message, but for this it will be mandatory
that readers check for them before incrementing/decrementing the counter.
Let'sdo that using a CAS. The readers performance is not as critical as
the writer's anyway so the slight overhead is not a problem.

MEDIUM: ring: align the head and tail fields in the ring_storage structure

We really want to let the readers and writers act on different areas, so
we want to have the tail and the head on separate cache lines, themselves
separate from the rest of the ring. Doing so improves the performance from
2.15 to 2.35M msg/s at 48 threads on a 24-core EPYC.

This increases the header space from 32 to 192 bytes when threads are
enabled. But since we already have the header size available in the file,
haring remains able to detect the aligned vs unaligned formats and call
dump_v2a() when aligned is detected.

MEDIUM: ring: remove the struct buffer from the ring

The purpose is to store a head and a tail that are independent so that
we can further improve the API to update them independently from each
other.

The struct was arranged like the original one so that as long as a ring
has its head set to zero (i.e. no recycling) it will continue to work.
The new format is already detectable thanks to the "rsvd" field which
indicates the number of reserved bytes at the beginning. It's located
where the buffer's area pointer previously was, so that older versions
of haring can continue to open the ring in repair mode, and newer ones
can use the fact that the upper bits of that variable are zero to guess
that it's working with the new format instead of the old one. Also let's
keep in mind that the layout will further change to place some alignment
constraints.

The haring tool will thus updated based on this and it detects that the
rsvd field is smaller than a page and that the sum of it with the size
equals the mapped size, in which case it uses the new dump_v2() function
instead of dump_v1(). The new function also creates a buffer from the
ring's area, size, head and tail and calls the generic one so that no
other code had to be adapted.

MEDIUM: ring: change the ring reader to use the new vector-based API now

The code now looks cleaner and more easily shows what still needs to be
addressed. There are not that many changes in practice, these are mostly
mechanical, essentially hiding the buffer from the callers.

MEDIUM: ring: replace the buffer API in ring_write() with the vec<->ring API

This is the start of the replacement of the buffer API calls. Only the
ring_write() function was touched. Instead of manipulating a buffer all
along, we now extract the ring buffer's head and tail upon entry, store
them locally and use them using the vec<->ring API until the last moment
where we can update the buffer with the new values. One subtle point is
that we must never fill the buffer past the last byte otherwise the
vec-to-ring conversion gets lost and there's no more possibility to know
where's the beginning nor the end (just like when dealing with head+tail
in fact), because it then becomes impossible to distinguish between an
empty and a full buffer.

MINOR: ring: allow to reduce a ring size

In ring_resize() we used to check if the new ring was at least as large
as the previous one before resizing it, but what counts is that it's as
large as the previous one's contents. Initially it was thought this
would not really matter, but given that rings are initially created as
BUFSIZE, it's currently not possible to shrink them for debugging
purposes. Now with this change it is.

MINOR: ring: resize only under thread isolation

The ring resizing was already quite tricky, but when facing atomic
writes it will no longer be possible and we definitely do not want to
have to deal with a lock there. Since it's only done at boot time, and
possibly later from the CLI, let's just do it under thread isolation.

MAJOR: ring: insert an intermediary ring_storage level

We'll need to add more complex structures in the ring, such as wait
queues. That's far too much to be stored into the area in case of
file-backed contents, so let's split the ring definition and its
storage once for all.

This patch introduces a struct ring_storage which is assigned to
ring->storage, which contains minimal information to represent the
storage layout, i.e. for now only the buffer, and all the rest
remains in the ring itself. The storage is appended immediately after
it and the buffer's pointer always points to that area. It has the
benefit of remaining 100% compatible with the existing file-backed
layout. In memory, the allocation loses the size of a struct buffer.

It's not even certain it's worth placing the size there, given that it's
constant and that a dump of a ring wouldn't really need it (the file size
is sufficient). But for now everything comes with the struct buffer, and
later this will change once split into head and tail. Also this area may
be completed with more information in the future (e.g. storage version,
format, endianness, word size etc).

MINOR: ring: add a flag to indicate a mapped file

Till now we used to rely on a heuristic pointer comparison to check if
a ring was mapped or allocated. Better assign a flag to clarify this
because it's going to become difficult otherwise.

MINOR: ring: use ring_size(), ring_area(), ring_head() and ring_tail()

Some open-coded constructs were updated to make use of the ring accessors
instead. This allows to remove some direct dependencies on the buffers
API a bit more.

MINOR: errors: use ring_dup() to duplicate the startup_logs

In startup_logs_dup() we currently need to reference the ring's buffer,
better not do this as it will complicate operations when switching to
other types.

MINOR: ring: make callers use ring_data() and ring_size(), not ring->buf

As we're going to remove the ring's buffer, we don't want callers to access
it directly, so let's use ring_data() and ring_size() instead for this.

MINOR: ring: also add ring_area(), ring_head(), ring_tail()

These will essentially be used to simplify the conversion to a new API.

MINOR: ring: add ring_dup() to copy a ring into another one

This will mostly be used during reallocation and boot-time duplicates,
the purpose is simply to save the caller from having to know the details
of the internal representation.

MINOR: ring: add ring_size() to return the ring's size

This is just to ease conversion so that callers stop accessing the ring's
buffer.

MINOR: ring: add ring_data() to report the amount of data in a ring

This will be used as an accessor for the few functions that need this
outside of ring.c.

MINOR: ring: rename totlen vs msglen in ring_write()

The ring_write() function uses confusing variable names: totlen is in
fact the length of the message, not the total length that is going to
be written. Let's rename it msglen and have a real "needed" that
corresponds to the total size we're going to write. We also add a
BUG_ON_HOT() to catch mistakes causing discrepancies.

MINOR: vecpair: add necessary functions to use vecpairss from/to ring APIs

Many ring-based APIs need a tail and a head, with some extra assumption
that the user takes care of not filling the ring so that tail==head is
unambiguous. Vectors are particularly suited to this usage so here we
create 4 functions to create vectors representing free room or data
from a ring, as well as updating rings based on a pair of vectors that
represents either free space or data.

MINOR: vecpair: add new vector pair based data manipulation mechanisms

The buffers API defines both a storage layout and how to handle the
data. The storage is shared with the chunks API which only deals with
non-wrapping messages while buffers support wrapping both of the data
and of the free space. As such, most of the buffers code already makes
special cases of two parts in a buffer, the first one before wrapping
and the optional second one after the wrapping occurred.

The thing is, there are plenty of other places (e.g. rings) where the
code dealing with wrapping is desirable but with a different storage
layout. Let's export the existing buffer handling code related to
reading/writing wrapping data and make it work with arbitrary vector
pairs instead. This will handle wrapping and holes in messages if
desired, and it will be up to the caller to decide how its messages
are arranged and to pass the relevant ptr,len elements.

The code is limited to two vectors because this is sufficient to deal
with wrapping without making the code needlessly complex. I.e. this will
not reassemble an iovec. For vectors, since we already had the ist type,
there's no point inventing a new type, and it's even possible that over
time some callers will find benefits in using this unified API (i.e. no
NOP translation layer). It also allows to pass inputs as direct arguments
and outputs as pointers. Not only this is more efficient code-wise, but
it also avoids the accidental use of a wrong function. It was indeed
found that naming functions is even harder than with the buffer as the
notion of from/to is even fuzzier here.

The API will likely continue to evolve and some functions might get
renamed to more explicit ones over time to limit confusion. For now
the code provides anything needed to reset/create/fill/erase/read/peek
or measure vector pairs and to manipulate chars/blocks/varints to/from
there.

MINOR: ring: reserve one special value for the readers count

In order to support concurrent writers we'll need to lock areas in the
buffer. For this we'll use one special value of the single-byte readers
count. Let's reserve it now and use the macro instead of the hardcoded
255.

MINOR: ring: make the ring reader use only absolute offsets

The goal is to remove references to the buffer's head and tail in the
fast path so that we can release the lock during some reads. This means
no more comparisons with b_data() nor operations relative to b_head()
will be possible anymore. As a first step we need to have an absolute
offset in the buffer, and to use b_getblk_ofs() in the applet callbacks
to retrieve the data based on this.

MINOR: buf: add b_getblk_ofs() that works relative to area and not head

For some concurrently accessed buffers we can't rely on head/data etc,
but sometimes the access patterns guarantees that the buffer contents
are there. Let's implement a function to read contents from a fixed
offset, which never checks head nor data, only the area and its size.
It's the caller's job to get this offset.

MINOR: buf: add b_putblk_ofs() to copy a block at a specific position

This new function b_putblk_ofs() puts one full block of data of length
<len> from <blk> into the buffer, starting from absolute offset <offset>
after the buffer's area. As a convenience to avoid complex checks in
callers, the offset is allowed to exceed a valid one by no more than one
buffer size, and will automatically be wrapped. The caller is responsible
for ensuring that <len> doesn't exceed the known length of the available
room at this position, otherwise data may be overwritten. The buffer's
length is *not* updated, so generally the caller will have updated it
before calling this function. This is meant to be used on concurrently
accessed buffers, so that a writer can append data while a reader is
blocked by other means from reaching the current area The function
guarantees never to use ->head nor ->data.

MINOR: buf: add b_rel_ofs() to turn an absolute offset into a relative one

It basically does the opposite of b_peek_ofs(). If x=b_peek_ofs(y), then
y=b_rel_ofs(x).

MINOR: buf: add b_add_ofs() to add a count to an absolute position

This function is used to compute a new absolute buffer offset by adding a
length to an existing valid offset. It will check for wrapping.

MEDIUM: log/sink: make the log forwarder code use ring_dispatch_messages()

This code becomes even simpler and almost does not need any knowledge
of the structure of the ring anymore. It even highlighted that an old
race had not been fixed due to code duplication, but that's now done.

MEDIUM: sink: move the generic ring forwarder code use ring_dispatch_messages()

Now the code is much simpler than the ring forwarding function almost does
not need any knowledge of the structure of the ring anymore.

MEDIUM: ring: move the ring reader code to ring_dispatch_messages()

This new function is made around the loop that scans a ring for new
messages and dispatches them to a message handler. It also takes
ring flags (WAIT, NEW, etc) and offset pointers that the caller will
use to initialize/reuse/update the current processing offset. The
caller is still responsible for presetting it to ~0 before the
first call if it wants the function to automatically adjust it (or set
it to the correct value). The function may also return the last_ofs
that was known before releasing the lock so that the caller knows
what to compare against and if it needs to restart processing or not.
The context remains a void* so that should not necessarily depend on
an appctx.

The current "show ring" code was ported to this and it continues to
work as expected.

REORG: dns/ring: split the ring between the generic one and the DNS one

A ring is used for the DNS code but slightly differently from the generic
one, which prevents some important changes from being made to the generic
code without breaking DNS. As the use cases differ, it's better to just
split them apart for now and have the DNS code use its own ring that we
rename dns_ring and let the generic code continue to live on its own.

The unused parts such as CLI registration were dropped, resizing and
allocation from a mapped area were dropped. dns_ring_detach_appctx() was
kept despite not being used, so as to stay consistent with the comments
that say it must be called, despite the DNS code explicitly mentioning
that it skips it for now (i.e. this may change in the future).

Hopefully after the generic rings are converted the DNS code can migrate
back to them, though this is really not necessary.

MEDIUM: ring/sink: use applet_append_line()/syslog_applet_append_event() for readers

The rink reader code was duplicated as-is in 2.2 for the ring forwarding
code in commits 494c505703 ("MEDIUM: ring: add server statement to forward
messages from a ring") and 975564784f ("MEDIUM: ring: add new srv statement
to support octet counting forward") (which only differs by using a prefix
instead of a suffix to delimit messages).

Unfortunately, that makes it almost impossible to rework the core ring
code because all these parts rely on it. This first commit aims at
restoring a common structure for the core loop by just calling a distinct
function based on the use case. The functions are either
applet_append_line() when a whole line is to be emitted followed by an LF
character, or syslog_applet_appent_event() when trying to send a TCP
syslog line prepended with its size in decimal.

There is no functional change beyond this.

MINOR: log/applet: add new function syslog_applet_append_event()

This function takes a buffer on input, and offset and a length, and
consumes the block from that buffer to send it to the appctx's output
buffer. Contrary to its sibling applet_append_line(), instead of just
appending an LF at the end of the line, it prepends the message size
in decimal and a space before the message, as expected by syslog TCP
implementaions. This will be used to simplify the ring reader code.

MINOR: applet: add new function applet_append_line()

This function takes a buffer on input, and offset and a length, and
consumes the block from that buffer to send it to the appctx's output
buffer. This will be used to simplify the ring reader code.

MINOR: atomic: add a read-specific variant of __ha_cpu_relax()

Tests on various systems show that x86 prefers not to wait at all inside
read loops while aarch64 prefers to wait a little bit. Instead of having
to stuff ifdefs around __ha_cpu_relax() inside plenty of such loops
waiting for a condition to appear, better implement a new variant that
we call __ha_cpu_relax_for_read() which honors each architecture's
preferences and is the same as __ha_cpu_relax() for other ones.

MINOR: debug: add "debug dev trace" to flood with traces

This new command, enabled only with "DEBUG_DEV", sends 2 or 20 traces
per task wakeup (depending on the verbosity level), and stops after 1M
wakeups per thread in order not to have to stop/start the process each
time it's fired.

We have two small messages and 18 larger ones from 20 to 270 bytes
each, so that the average size is approx 213 bytes counting headers
(the header adds approx 82 bytes), which matches what's generally
observed on average when traces are enabled in all muxes.

Typical figures show varations between 5.7M and 6.2M msg/s on an EPYC
in a 3C6T setup (single CCX), and 2.12M - 2.22M in a 24C48T setup
(across 8 CCX, with 8 thread groups).

OPTIM: http_ext: avoid useless copy in http_7239_extract_{ipv4,ipv6}

In http_7239_extract_{ipv4,ipv6}, we declare a local buffer in order to
use inet_pton() since it requires a valid destination argument (cannot be
NULL). Then, if the caller provided <ip> argument, we copy inet_pton()
result (from local buffer to <ip>).

In fact when the caller provides <ip>, we may directly use <ip> as
inet_pton() dst argument to avoid an useless copy. Thus the local buffer
is only relevant when the user doesn't provide <ip>.

While at it, let's add a missing testcase for the rfc7239_n2nn converter
(to check that http_7239_extract_ipv4() with <ip> provided works properly)

This could be backported in 2.8 with b2bb925 ("MINOR: proxy/http_ext:
introduce proxy forwarded option")

BUILD: server: fix build regression on old compilers (<= gcc-4.4)

Willy reported that since 3ac79b504 ("MEDIUM: server:
make server_set_inetaddr() updater serializable"), haproxy fails to
compile on some older compilers such as gcc-4.4 with this kind of error:

  src/server.c: In function 'snr_resolution_cb':
  src/server.c:4471: error: unknown field 'dns_resolver' specified in initializer
  compilation terminated due to -Wfatal-errors.
  make: *** [Makefile:1006: src/server.o] Error 1

This is due to referencing a member inside anonymous union from a compound
literal assignment. Apparently such use of anonymous union wasn't properly
supported back then on older compilers. To fix the issue, we give "u" name
to the parent union use this name to explicitly refer to the union where
relevant in the code (only a few changes fortunately).

The fix itself was verified to restore build compatibility with gcc 4.4
(and even 4.2).

As 3ac79b504 is used as a prerequisite for 64c9c8ef3 ("BUG/MINOR:
server/dns: use server_set_inetaddr() to unset srv addr from DNS"), please
consider backporting this patch too if 64c9c8ef3 happens to be backported
in 2.9.

BUG/MEDIUM: mux-fcgi: Properly handle EOM flag on end-of-trailers HTX block

Trailers are skipped by the FCGI multiplexer. However empty chunked messages
are not properly handled. It may be a chunked H1 request with no payload or
a H2/H3 POST request with no payload. In that caes, the EOT HTX block is
just ignored. The issue is that the EOM flag is thus ignored too. It means
no empty STDIN record is sent to mark the end of the request to the server.

To fix the issue, when a EOT htx block is found and it is the last HTX block
of the message (and it should be), the EOM flag is tested. If it is found,
an empty STDIN record is emitted.

This patch should fix the issue #2499. It must be backported as far as 2.4.

BUG/MINOR: mux-quic: close all QCS before freeing QCC tasklet

QUIC MUX is freed via qcc_release(). This in turn liberate all the
remaining QCS instances. For each one of them, their corresponding
stream-desc is released via qc_stream_desc_release().

This last function may itself notifies QUIC MUX when new buffers are
available. This is useful when QCS are closed individually without the
whole connection. However, when the connection is closed through
qcc_release(), this may cause issue as some elements of QUIC MUX are
already freed.

In 2.9.6, a bug was detected directly linked to this. Indeed, QCC
instance may be woken up on stream-desc release. If called through
qcc_release(), this is an issue because QCC tasklet is freed before QCS
instances. However, this bug is not systematic and relies on prior
conditions : in particular, QUIC MUX must be under Tx buffers exhaustion
prior to the qcc_release() invocation.

The current dev tree is not impacted by this bug, thanks to QUIC MUX
refactoring. Indeed, notifying accross layers have changed and now
stream-desc release notifies individual QCS instances instead of the QCC
element, which is a safer mechanism. However, to simplify backport
process, bugfix is introduced in the current dev tree as it does not
have any impact.

Note that a proper fix would be to set quic-conn MUX state to
QC_MUX_RELEASED. However, it is not possible to call quic_close()
without having releasing all stream-desc elements first. The simpler
solution was chosen to prevent other breaking issues during backports.

This should fix github issue #2494.

It should be backported up to 2.6. Note that prior to 2.7 qcc_release()
was named qc_release().

MEDIUM: server: close private idle connection before server deletion

This commit similar to the following one :
65ae241dcfe710e1cdd3ec4e7a9bde38d2e4c116
MEDIUM: server: close idle conn before server deletion

This patch implements a similar logic, this time to close private idle
connections stored in sessions. The principle is identical to the above
commit : conn_release() is used on idle connections after a takeover to
ensure thread safety.

An extra change was required to be able to execute takeover on such
connections. Their original thread ID was unknown, contrary to non
private connections which are stored in sharded lists. As such, a new
tid member has been added under sess_priv_conns chaining element.

MEDIUM: mux: prepare for takeover on private connections

When a backend connection is marked as idle, a special flag TASK_F_USR1
is set on MUX tasklet. When MUX tasklet is reactivated, extra checks are
executed under this flag to ensure no takeover occurred in the meantime.

Previously, only non private connections could be targetted by a
takeover. However, this will change when implementing private idle
connections closure on "delete server" CLI handler. As such, TASK_F_USR1
is now also set for private connections in MUX detach callbacks.

MEDIUM: server: close idle conn on server deletion

To be able to delete a server, a number of preconditions must be
validated to ensure it is not in used anymore. Previously, if idle
connections were stored in the server, the deletion was cancelled. No
action was implemented to force idle connection closure, the only
solution was to wait for the periodic purging to be achieved.

This is an extra burden to be able to delete a server. Indeed, idle
connections are by definition inactive and can be closed prior to delete
a server. This is the exact purpose of this patch.

Idle connections removal is implemented inside "delete server" handler,
once it has been determined that the server can be freely removed. A
simple loop is run to call conn_release() over each idle connections.
Takeover is also executed before conn_release() to ensure tasks/tasklets
or any other sensible elements are not deleted from a foreign thread.

This patch should reduce the occurence of rejected "delete server"
execution, especially when connection reuse is high.

MINOR: connection: extend takeover with release option

Extend takeover API both for MUX and XPRT with a new boolean argument
<release>. Its purpose is to signal if the connection will be freed
immediately after the takeover, rendering new resources allocation
unnecessary.

For the moment, release argument is always false. However, it will be
set to true on delete server CLI handler to proactively close server
idle connections.

MINOR: connection: implement conn_release()

Several places reuse the same code to ensure a connection is properly
freed, either via its MUX or by calling the proper set of functions.
Factorize all of this in a new function conn_release().

This new function is now called via session_free() and
session_accept_fd(). It will also be reused on delete server to
proactively close idle connections.

REGTESTS: ssl: Add checks on ocsp-update log format

Add checks on the ocsp-update's dedicated log format.

CLEANUP: ssl: Remove undocumented ocsp fetches

Those fetchess were undocumented and were just here so that the
ocsp-update log could be made through a regular log format. But since
the logging is now "handmade" (since BUG/MEDIUM: ssl: Fix crash in
ocsp-update log function), we don't need those anymore.

MINOR: ssl: Change level of ocsp-update logs

The pure ocsp-update log used to be in log level "info" and it would be
mixed with actual traffic logs. This patch changes it to level "notice".

MEDIUM: ssl: Change output of ocsp-update log

Since commit "BUG/MEDIUM: ssl: Fix crash in ocsp-update log function",
some information from the log line are "faked" because they can be
actually retrieved anymore (or never could). We should then remove them
from the logline all along instead of providing some useless fields.

We then only keep pure OCSP-update information in the log line:
"<certname> <status> <status str> <fail count> <success count>"

BUG/MEDIUM: ssl: Fix crash in ocsp-update log function

The ocsp-update logging mechanism was built around the 'sess_log'
function which required to keep a pointer to the said session until the
logging function could be called. This was made by keeping a pointer to
the appctx returned by the 'httpclient_start' function. But this appctx
lives its life on its own and might be destroyed before
'ssl_ocsp_send_log' is called, which could result in a crash (UAF).
Fixing this crash requires to stop using the 'sess_log' function to emit
the ocsp-update logs. The log line will then need to be built by hand
out of the information actually available when 'ssl_ocsp_send_log' is
called. Since we don't use the "regular" logging functions anymore, we
don't need to use the error_logformat anymore. In order to keep a
consistent behavior than before, we will keep the same format for the
logs but replace the fields that required a 'sess' pointer by fake
values (the %ci:%cp for instance, which was never filled anyway).

This crash was raised in GitHub issue #2442.
It should be backported up to branch 2.8.

BUG/MEDIUM: ssl: Fix crash when calling "update ssl ocsp-response" when an update is ongoing

The CLI command "update ssl ocsp-response" was forcefully removing an
OCSP response from the update tree regardless of whether it used to be
in it beforehand or not. But since the main OCSP upate task works by
removing the entry being currently updated from the update tree and then
reinserting it when the update process is over, it meant that in the CLI
command code we were modifying a structure that was already being used.

These concurrent accesses were not properly locked on the "regular"
update case because it was assumed that once an entry was removed from
the update tree, the update task was the only one able to work on it.

Rather than locking the whole update process, an "updating" flag was
added to the certificate_ocsp in order to prevent the "update ssl
ocsp-response" command from trying to update a response already being
updated.

An easy way to reproduce this crash was to perform two "simultaneous"
calls to "update ssl ocsp-response" on the same certificate. It would
then crash on an eb64_delete call in the main ocsp update task function.

This patch can be backported up to 2.8. Wait a little bit before
backporting.

REGTESTS: ssl: Add OCSP related tests

Add tests that combine the OCSP update mechanism and the various
preexisting commands that allow to manipulate certificates and
crt-lists.

BUG/MAJOR: ocsp: Separate refcount per instance and per store

With the current way OCSP responses are stored, a single OCSP response
is stored (in a certificate_ocsp structure) when it is loaded during a
certificate parsing, and each SSL_CTX that references it increments its
refcount. The reference to the certificate_ocsp is kept in the SSL_CTX
linked to each ckch_inst, in an ex_data entry that gets freed when the
context is freed.
One of the downsides of this implementation is that if every ckch_inst
referencing a certificate_ocsp gets detroyed, then the OCSP response is
removed from the system. So if we were to remove all crt-list lines
containing a given certificate (that has an OCSP response), and if all
the corresponding SSL_CTXs were destroyed (no ongoing connection using
them), the OCSP response would be destroyed even if the certificate
remains in the system (as an unused certificate).
In such a case, we would want the OCSP response not to be "usable",
since it is not used by any ckch_inst, but still remain in the OCSP
response tree so that if the certificate gets reused (via an "add ssl
crt-list" command for instance), its OCSP response is still known as
well.
But we would also like such an entry not to be updated automatically
anymore once no instance uses it. An easy way to do it could have been
to keep a reference to the certificate_ocsp structure in the ckch_store
as well, on top of all the ones in the ckch_instances, and to remove the
ocsp response from the update tree once the refcount falls to 1, but it
would not work because of the way the ocsp response tree keys are
calculated. They are decorrelated from the ckch_store and are the actual
OCSP_CERTIDs, which is a combination of the issuer's name hash and key
hash, and the certificate's serial number. So two copies of the same
certificate but with different names would still point to the same ocsp
response tree entry.

The solution that answers to all the needs expressed aboved is actually
to have two reference counters in the certificate_ocsp structure, one
actual reference counter corresponding to the number of "live" pointers
on the certificate_ocsp structure, incremented for every SSL_CTX using
it, and one for the ckch stores.
If the ckch_store reference counter falls to 0, the corresponding
certificate must have been removed via CLI calls ('set ssl cert' for
instance).
If the actual refcount falls to 0, then no live SSL_CTX uses the
response anymore. It could happen if all the corresponding crt-list
lines were removed and there are no live SSL sessions using the
certificate anymore.
If any of the two refcounts becomes 0, we will always remove the
response from the auto update tree, because there's no point in spending
time updating an OCSP response that no new SSL connection will be able
to use. But the certificate_ocsp object won't be removed from the tree
unless both refcounts are 0.

Must be backported up to 2.8. Wait a little bit before backporting.

BUG/MAJOR: connection: fix server used_conns with H2 + reuse safe

By default, backend connections are accounted by the server. This allows
to determine the number of idle connections to keep. A backend
connection can also be marked as private to prevent its reuse. It is
thus removed from server lists into the session list. As such, a private
connection is not accounted into server : conn_set_private() uses
srv_release_conn() to ensure this.

When using HTTP/2 on backend side with default http-reuse safe, the
above principle are mixed. Indeed, when a connection is first used, or
switches from idle to used, it is moved into the session list but it is
not flagged as private. This is done to prevent its sharing by different
clients to prevent head-of-line blocking issue. When all streams are
closed, the connection becomes idle again and is reinserted in the
server list. This has been introduced by the following patch :

0d21deaded1a4bbd1e1700ab8386af1f1441ea73
MEDIUM: backend: add reused conn to sess if mux marked as HOL blocking

When freeing a backend connection, special care is taken to ensure
server used counter is decremented. This is implemented into
conn_backend_deinit(). However, this function does this only if the
connection is not present in a session list. This is valid for private
connections. However, if a connection is non-private and present only
temporarily into a session list, the decrement operation won't be
executed despite the connection being accounted by the server.

This bug has several impacts. The server used counter won't be able to
reach its initial null value, even when all its connections are closed.
This can result in a wrong estimation of necessary idle connections,
which may cause unnecessary new connection usage. Also, this will
prevent definitely the server from being removed via "delete server" CLI
command.

This should be backported up to 2.4. Note that conn_backend_deinit() was
introduced in 2.9. For lesser versions, the change should be done
directly into conn_free().

BUG/MEDIUM: http_ana: ignore NTLM for reuse aggressive/always and no H1

Backend connections can be marked as private to prevent their sharing by
multiple clients. Now, this has become an exception as only two reasons
for data traffic can trigger this (checks are ignored here) :
* http-reuse never
* HTTP response with NTLM header

The first case is easy to manage as the connection is flagged as private
since its inception. However, the second case is dynamic as the
connection can be flagged anytime during its lifetime. When using a
backend protocol such as HTTP/2 with reuse mode aggressive or always, we
face a design issue as the connection would be marked as private,
despite potentially being shared by several clients at the same time.

This is conceptually invalid, but worst it can trigger crashes on MUX
stream detach callback depending on the order of release of the streams,
by calling session_check_idle_conn() with a NULL session. It could also
be possible to have several NTLM responses on a single connection for
different sessions. In this case, connection owner is still being
updated without attaching the connection to its correct session, which
ultimately would cause a crash on session_check_idle_conn with an
invalid session.

Here are two backtrace examples from GDB for such cases :

Thread 1 (Thread 0x7ff73e9fc700 (LWP 648859)):
#0  session_check_idle_conn (conn=0x7ff72f597800, sess=0x0) at include/haproxy/session.h:209
#1  h2_detach (sd=<optimized out>) at src/mux_h2.c:4520
#2  0x000056151742be24 in sc_detach_endp (scp=scp@entry=0x7ff73e9f0f18) at src/stconn.c:376
#3  0x000056151742c208 in sc_destroy (sc=<optimized out>) at src/stconn.c:444
#4  0x0000561517370871 in stream_free (s=s@entry=0x7ff72a2dbd80) at src/stream.c:728
#5  0x000056151737541f in process_stream (t=t@entry=0x7ff72d5e2620, context=0x7ff72a2dbd80, state=<optimized out>) at src/stream.c:2645
#6  0x0000561517456cbb in run_tasks_from_lists (budgets=budgets@entry=0x7ff73e9f10d0) at src/task.c:632
#7  0x00005615174576b9 in process_runnable_tasks () at src/task.c:876
#8  0x000056151742275a in run_poll_loop () at src/haproxy.c:2996
#9  0x0000561517422db1 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3195
#10 0x00007ff789e081ca in start_thread () from /lib64/libpthread.so.0
#11 0x00007ff789a39e73 in clone () from /lib64/libc.so.6

(gdb)
Thread 1 (Thread 0x7ff52e7fc700 (LWP 681458)):
#0  0x0000556ebd6e7e69 in session_check_idle_conn (conn=0x7ff5787ff100, sess=0x7ff51d2539a0) at include/haproxy/session.h:209
#1  h2_detach (sd=<optimized out>) at src/mux_h2.c:4520
#2  0x0000556ebd7f3e24 in sc_detach_endp (scp=scp@entry=0x7ff52e7f0f18) at src/stconn.c:376
#3  0x0000556ebd7f4208 in sc_destroy (sc=<optimized out>) at src/stconn.c:444
#4  0x0000556ebd738871 in stream_free (s=s@entry=0x7ff520e28200) at src/stream.c:728
#5  0x0000556ebd73d41f in process_stream (t=t@entry=0x7ff565783700, context=0x7ff520e28200, state=<optimized out>) at src/stream.c:2645
#6  0x0000556ebd81ecbb in run_tasks_from_lists (budgets=budgets@entry=0x7ff52e7f10d0) at src/task.c:632
#7  0x0000556ebd81f6b9 in process_runnable_tasks () at src/task.c:876
#8  0x0000556ebd7ea75a in run_poll_loop () at src/haproxy.c:2996
#9  0x0000556ebd7eadb1 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3195
#10 0x00007ff5752081ca in start_thread () from /lib64/libpthread.so.0
#11 0x00007ff574e39e73 in clone () from /lib64/libc.so.6
(gdb)

To solve this issue, simply ignore NTLM responses when using a
multiplexer with streams support and the connection is not already
attached to the session. The connection is not marked as private and
will continue to be shared freely accross clients. This is considered
conceptually valid as NTLM usage (rfc 4559) with HTTP is broken and was
designed only with HTTP/1.1 in mind. A side-effect of the change is that
SESS_FL_PREFER_LAST is also not set anymore on NTLM detection, which
allows following requests to be load-balanced accross several server
instances.

The original behavior is kept for HTTP/1 or if the connection is already
attached to the session. This last case happens when using HTTP/2 with
default http-reuse safe mode since the following patch :

  0d21deaded1a4bbd1e1700ab8386af1f1441ea73
  MEDIUM: backend: add reused conn to sess if mux marked as HOL blocking

This should be backported up to all stable releases. Up until 2.4, it
can be taken as-is. For lesser versions, above patch is not present. In
this case the condition should be restricted only to HTTP/1 usage :

  if (srv_conn && strcmp(srv_conn->mux->name, "H1") == 0) {

BUG/MINOR: session: ensure conn owner is set after insert into session

A crash could occured if a session_add_conn() would temporarily failed
when called via h2_detach(). In this case, connection owner is reset to
NULL. However, if this wasn't the last connection stream, the connection
won't be destroyed. When h2_detach() is recalled for another stream and
this time session_add_conn() succeeds, a crash will occur due to
session_check_idle_conn() invocation with a NULL connection owner.

To fix this, ensure connection owner is always set after
session_add_conn() success.

This bug is considered as minor as the only failure reason for
session_add_conn() is a pool allocation issue.

This should be backported up to all stable releases.

BUG/MEDIUM: spoe: Return an invalid frame on recv if size is too small

Frames with a too small size must be detected on receive and an error must
be triggered. It is especially important for frames of size 0. Otherwise,
because the frame length is used as return value, the frame is ignored (0 is
the return value to state the frame must be ignored). It is an issue because
in this case, outgoing data, the 4 bytes representing the frame size, are
never consumed. If the agent also closes the connection, this leads to a
wakeup loop because outgoing data are stuck and a shutdown is pending.

In addition, all pending outgoing data are systematcially skipped when the
applet is in SPOE_APPCTX_ST_END state.

The patch should fix the issue #2490. It must be backported to all stable
versions.