git.ipfire.org Git - thirdparty/haproxy.git/log

BUG/MEDIUM: dns: ensure ring offset is properly reajusted to head

Since the below patch, ring offset calculation for readers has changed.
  commit d9c718863384e32307f65a9ce319dc362b73feb6
  MEDIUM: ring: make the offset relative to the head/tail instead of absolute

For readers, this requires to adjust their offsets to be relative to the
ring head each time read is resumed. Indeed, buffer head can change any
time a ring_write() is performed after older entries were purged.
This operation was not performed on the DNS code which causes the offset
to become invalid. In most cases, the following BUG_ON() was triggered :

  FATAL: bug condition "msg_len + ofs + cnt + 1 > b_data(buf)" matched
  at src/dns.c:522

Fix this by adjusting DNS reader offsets when entering
dns_session_io_handler() and dns_process_req().

This bug was reproduced by using a backend with 10 servers using SRV
record resolution on a single resolvers section. A BUG_ON() crash would
occur after less than 5 minutes of process execution.

This does not need to be backported as the above patch is not.

This should fix github issue #2068.

BUG/MAJOR: fd/thread: fix race between updates and closing FD

While running some L7 retries tests, Christopher and I stumbled upon a
very strange behavior showing some occasional server timeouts when the
server closes keep-alive connections quickly. The issue can be
reproduced with the following config:

    global
        expose-experimental-directives
        #tune.fd.edge-triggered on   # can speed up the issue

    defaults
        mode http
        timeout client 5s
        timeout server 10s
        timeout connect 2s

    listen f
        bind :8001
        http-reuse always
        retry-on all-retryable-errors
        server next 127.0.0.1:8002

    frontend b
        bind :8002
        timeout http-keep-alive 1  # one ms
        redirect location /

Sending fast requests without reusing the client connection on port 8001
with a single connection and at least 3 threads on haproxy occasionally
shows some glitches pauses (below with timeout server 2s):

  $ taskset -c 2,3 h1load  -e -t 1 -r 1 -c 1 http://127.0.0.1:8001/
  #     time conns tot_conn  tot_req      tot_bytes    err  cps  rps  bps   ttfb
           1     1     9794     9793         959714      0 9k79 9k79 7M67 42.94u
           2     1     9794     9793         959714      0 0.00 0.00 0.00    -
           3     1     9794     9793         959714      0 0.00 0.00 0.00    -
           4     0    16015    16015        1569470      0 6k22 6k22 4M87 522.9u
           5     0    18657    18656        1828190      2 2k63 2k63 2M06 39.22u

If this doesn't happen, limiting to a request rate close to 1/timeout
may help.

What is happening is that after several migrations, a late report
via fd_update_events() may detect that the thread is not welcome, and
will want to program an update so that the current thread's poller
disables its polling on it. It is allowed to do so because it used
fd_grab_tgid(). But what if _fd_delete_orphan() was just starting to
be called and already reset the update_mask ? We'll end up with a bit
present in the update mask, then _fd_delete_orphan() resets the tgid,
which will prevent the poller from consuming that update. The update
is not needed anymore since the FD was closed, but in this case nobody
will clear this bit until the same FD is reused again and cleared. And
as long as the thread's bit remains in the update_mask, no new updates
will be programmed for the next use of this FD on the same thread since
due to the bit being present, fd_nbupdt will not be changed. This is
what is causing this timeout.

The fix consists in making sure _fd_delete_orphan() waits for the
occasional watchers to leave, and to do this before clearing the
update_mask. This will be either fd_update_events() trying to check
its thread_mask, or the poller checking its updates, so that's pretty
short. But it definitely closes this race.

This fix is needed since the introduction of fd_grab_tgid(), hence 2.7.

Note that while testing the fix, another related issue concerning the
atomicity of running_mask vs thread_mask popped up and will have to be
fixed till 2.5 as part of another patch. It may make the tests for this
fix occasionally tigger a few BUG_ON() or face a null conn->subs in
sock_conn_iocb(), though these ones are much more difficult to trigger.
This is not caused by this fix.

BUG/MEDIUM: quic: do not crash when handling STREAM on released MUX

The MUX instance is released before its quic-conn counterpart. On
termination, a H3 GOAWAY is emitted to prevent the client to open new
streams for this connection.

The quic-conn instance will stay alive until all opened streams data are
acknowledged. If the client tries to open a new stream during this
interval despite the GOAWAY, quic-conn is responsible to request its
immediate closure with a STOP_SENDING + RESET_STREAM.

This behavior was already implemented but the received packet with the
new STREAM was never acknowledged. This was fixed with the following
commit :
  commit 156a89aef8c63910502b266251dc34f648a99fae
  BUG/MINOR: quic: acknowledge STREAM frame even if MUX is released

However, this patch introduces a regression as it did not skip the call
to qc_handle_strm_frm() despite the MUX instance being released. This
can cause a segfault when using qcc_get_qcs() on a released MUX
instance. To fix this, add a missing break statement which will skip
qc_handle_strm_frm() when the MUX instance is not initialized.

This commit was reproduced using a short timeout client and sending
several requests with delay between them by using a modified aioquic. It
produces a crash with the following backtrace :
#0  0x000055555594d261 in __eb64_lookup (x=4, root=0x7ffff4091f60) at include/import/eb64tree.h:132
#1  eb64_lookup (root=0x7ffff4091f60, x=4) at src/eb64tree.c:37
#2  0x000055555563fc66 in qcc_get_qcs (qcc=0x7ffff4091dc0, id=4, receive_only=1, send_only=0, out=0x7ffff780ca70) at src/mux_quic.c:668
#3  0x0000555555641e1a in qcc_recv (qcc=0x7ffff4091dc0, id=4, len=40, offset=0, fin=1 '\001', data=0x7ffff40c4fef "\001&") at src/mux_quic.c:974
#4  0x0000555555619d28 in qc_handle_strm_frm (pkt=0x7ffff4088e60, strm_frm=0x7ffff780cf50, qc=0x7ffff7cef000, fin=1 '\001') at src/quic_conn.c:2515
#5  0x000055555561d677 in qc_parse_pkt_frms (qc=0x7ffff7cef000, pkt=0x7ffff4088e60, qel=0x7ffff7cef6c0) at src/quic_conn.c:3050
#6  0x00005555556230aa in qc_treat_rx_pkts (qc=0x7ffff7cef000, cur_el=0x7ffff7cef6c0, next_el=0x0) at src/quic_conn.c:4214
#7  0x0000555555625fee in quic_conn_app_io_cb (t=0x7ffff40c1fa0, context=0x7ffff7cef000, state=32848) at src/quic_conn.c:4640
#8  0x00005555558a676d in run_tasks_from_lists (budgets=0x7ffff780d470) at src/task.c:596
#9  0x00005555558a725b in process_runnable_tasks () at src/task.c:876
#10 0x00005555558522ba in run_poll_loop () at src/haproxy.c:2945
#11 0x00005555558529ac in run_thread_poll_loop (data=0x555555d14440 <ha_thread_info+64>) at src/haproxy.c:3141
#12 0x00007ffff789ebb5 in ?? () from /usr/lib/libc.so.6
#13 0x00007ffff7920d90 in ?? () from /usr/lib/libc.so.6

This should fix github issue #2067.

This must be backported up to 2.6.

MINOR: quic: Send PING frames when probing Initial packet number space

In very very rare cases, it is possible the Initial packet number space
must be probed even if it there is no more in flight CRYPTO frames.
In such cases, a PING frame is sent into an Initial packet. As this
packet is ack-eliciting, it must be padded by the server. qc_do_build_pkt()
is modified to do so.

Take the opportunity of this patch to modify the trace for TX frames to
easily distinguished them from other frame relative traces.

Must be backported to 2.7.

BUG/MINOR: quic: Missing detections of amplification limit reached

Mark the connection as limited by the anti-amplification limit when trying to
probe the peer.
Wakeup the connection PTO/dectection loss timer as soon as a datagram is
received. This was done only when the datagram was dropped.
This fixes deadlock issues revealed by some interop runner tests.

Must be backported to 2.7 and 2.6.

BUG/MINOR: quic: Do not resend already acked frames

Some frames are marked as already acknowledged from duplicated packets
whose the original packet has been acknowledged. There is no need
to resend such packets or frames.

Implement qc_pkt_with_only_acked_frms() to detect packet with only
already acknowledged frames inside and use it from qc_prep_fast_retrans()
which selects the packet to be retransmitted.

Must be backported to 2.6 and 2.7.

BUG/MINOR: quic: Ensure not to retransmit packets with no ack-eliciting frames

Even if there is a check in callers of qc_prep_hdshk_fast_retrans() and
qc_prep_fast_retrans() to prevent retransmissions of packets with no ack-eliciting
frames, these two functions should pay attention not do to that especially if
someone decides to modify their implementations in the future.

Must be backported to 2.6 and 2.7.

BUG/MINOR: quic: Remove force_ack for Initial,Handshake packets

This is an old bug which arrived in this commit due to a misinterpretation
of the RFC I guess where the desired effect was to acknowledge all the
handshake packets:

77ac6f566 BUG/MINOR: quic: Missing acknowledgments for trailing packets

This had as bad effect to acknowledge all the handshake packets even the
ones which are not ack-eliciting.

Must be backported to 2.7 and 2.6.

MINOR: quic: Add traces about QUIC TLS key update

Dump the secret used to derive the next one during a key update initiated by the
client and dump the resulted new secret and the new key and iv to be used to
decryption Application level packets.

Also add a trace when the key update is supposed to be initiated on haproxy side.

This has already helped in diagnosing an issue evealed by the key update interop
test with xquic as client.

Must be backported to 2.7.

BUG/MINOR: quic: v2 Initial packets decryption failed

v2 interop runner test revealed this bug as follows:

     [01|quic|4|c_conn.c:4087] new packet : qc@0x7f62ec026e30 pkt@0x7f62ec056390 el=I pn=491940080 rel=H
     [01|quic|5|c_conn.c:1509] qc_pkt_decrypt(): entering : qc@0x7f62ec026e30
     [01|quic|0|c_conn.c:1553] quic_tls_decrypt() failed : qc@0x7f62ec026e30
     [01|quic|5|c_conn.c:1575] qc_pkt_decrypt(): leaving : qc@0x7f62ec026e30
     [01|quic|0|c_conn.c:4091] packet decryption failed -> dropped : qc@0x7f62ec026e30 pkt@0x7f62ec056390 el=I pn=491940080

Only v2 Initial packets decryption received by the clients were impacted. There
is no issue to encrypt v2 Initial packets. This is due to the fact that when
negotiated the client may send two versions of Initial packets (currently v1,
then v2). The selection was done for the TX path but not on the RX path.

Implement qc_select_tls_ctx() to select the correct TLS cipher context for all
types of packets and call this function before removing the header protection
and before deciphering the packet.

Must be backported to 2.7.

BUG/MINOR: quic: Ensure to be able to build datagrams to be retransmitted

When retransmitting datagrams with two coalesced packets inside, the second
packet was not taken into consideration when checking there is enough space
into the network for the datagram, especially when limited by the anti-amplification.

Must be backported to 2.6 and 2.7.

MINOR: quic: Add a BUG_ON_HOT() call for too small datagrams

This should be helpful to detect too small datagrams: datagrams
smaller than 1200 bytes, with Initial packets inside.

Must be backported to 2.7.

BUG/MINOR: quic: Do not send too small datagrams (with Initial packets)

Before building a packet into a datagram, ensure there is sufficient space for at
least 1200 bytes. Also pad datagrams with only one ack-eliciting Initial packet
inside.

Must be backported to 2.7 and 2.6.

MINOR: http_ext: adding some documentation, forgot to inline function

Making http_7239_valid_obfsc() inline because it is only called by inline
functions.

Removing dead comment and documenting proxy_http_parse_{7239,xff,xot} functions.

No backport needed.

BUG/MINOR: cli: fix CLI handler "set anon global-key" call

Anonymization mode has two CLI handlers "set anon <on|off>" and "set
anon global-key". The last one only requires admin level. However, as
cli_find_kw() is implemented, only the first handler will be retrieved
as they both start with the same prefix "set anon".

This has the effect to execute the wrong handler for "set anon
global-key" with an error message about an invalid keyword. To fix this,
handlers definition have been separated for both "set anon on" and "set
anon off" commands. This allows to have minimal changes while keeping
the same "set anon" prefix for each commands.

Also take this opportunity to fix a reference to a non-existing "set
global-key" CLI handler in the documentation.

This must be backported up to 2.7.

BUG/MEDIUM: quic: properly handle duplicated STREAM frames

When a STREAM frame is re-emitted, it will point to the same stream
buffer as the original one. If an ACK is received for either one of
these frame, the underlying buffer may be freed. Thus, if the second
frame is declared as lost and schedule for retransmission, we must
ensure that the underlying buffer is still allocated or interrupt the
retransmission.

Stream buffer is stored as an eb_tree indexed by the stream ID. To avoid
to lookup over a tree each time a STREAM frame is re-emitted, a lost
STREAM frame is flagged as QUIC_FL_TX_FRAME_LOST.

In most cases, this code is functional. However, there is several
potential issues which may cause a segfault :
- when explicitely probing with a STREAM frame, the frame won't be
  flagged as lost
- when splitting a STREAM frame during retransmission, the flag is not
  copied

To fix both these cases, QUIC_FL_TX_FRAME_LOST flag has been converted
to a <dup> field in quic_stream structure. This field is now properly
copied when splitting a STREAM frame. Also, as this is now an inner
quic_frame field, it will be copied automatically on qc_frm_dup()
invocation thus ensuring that it will be set on probing.

This issue was encounted randomly with the following backtrace :
#0  __memmove_avx512_unaligned_erms ()
#1  0x000055f4d5a48c01 in memcpy (__len=18446698486215405173, __src=<optimized out>,
#2  quic_build_stream_frame (buf=0x7f6ac3fcb400, end=<optimized out>, frm=0x7f6a00556620,
#3  0x000055f4d5a4a147 in qc_build_frm (buf=buf@entry=0x7f6ac3fcb5d8,
#4  0x000055f4d5a23300 in qc_do_build_pkt (pos=<optimized out>, end=<optimized out>,
#5  0x000055f4d5a25976 in qc_build_pkt (pos=0x7f6ac3fcba10,
#6  0x000055f4d5a30c7e in qc_prep_app_pkts (frms=0x7f6a0032bc50, buf=0x7f6a0032bf30,
#7  qc_send_app_pkts (qc=0x7f6a0032b310, frms=0x7f6a0032bc50) at src/quic_conn.c:4184
#8  0x000055f4d5a35f42 in quic_conn_app_io_cb (t=0x7f6a0009c660, context=0x7f6a0032b310,

This should fix github issue #2051.

This should be backported up to 2.6.

BUG/MINOR: ssl: Use 'date' instead of 'now' in ocsp stapling callback

In the OCSP response callback, instead of using the actual date of the
system, the scheduler's 'now' timer is used when checking a response's
validity.

This patch can be backported to all stable versions.

MINOR: ssl: Replace now.tv_sec with date.tv_sec in ocsp update task

Instead of relying on the scheduler's timer in the main ocsp update
task, we use the actual system's date.

BUG/MINOR: ssl: Fix ocsp-update when using "add ssl crt-list"

When adding a new certificate through the CLI and appending it to a
crt-list with the 'ocsp-update' option set, the new certificate would
not be added to the OCSP response update list.
The only thing that was missing was the copy of the ocsp_update mode
from the ssl_bind_conf into the ckch_store's object.
An extra wakeup of the update task also needed to happen in case the
newly inserted entry needs to be updated before the next wakeup of the
task.

This patch does not need to be backported.

MINOR: ssl: Add ocsp-update information to "show ssl crt-list"

The "show ssl crt-list <list>" CLI command did not manage the new
ocsp-update option yet.

REGTESTS: ssl: Add test for new ocsp update cli commands

Add tests for the "show ssl ocsp-updates" cli command as well as the new
'base64' parameter that can be passed to the "show ssl ocsp-response"
command.

REGTESTS: ssl: Fix ocsp update crt-lists

The options were after the filters which does not work well and now
raises a warning. It did not break the regtest because the crt-lists
were not actually used by clients.

MINOR: ssl: Add global options to modify ocsp update min/max delay

The minimum and maximum delays between two automatic updates of a given
OCSP response can now be set via global options. It allows to limit the
update rate of OCSP responses for configurations that use many frontend
certificates with the ocsp-update option set if the updates are deemed
too costly.

MINOR: ssl: Add way to dump ocsp response in base64

A new format option can be passed to the "show ssl ocsp-response" CLI
command to dump the contents of an OCSP response in base64. This is
needed because thanks to the new OCSP auto update mechanism, we could
end up using an OCSP response internally that was never provided by the
user.

MINOR: ssl: Increment OCSP update replay delay in case of failure

In case of successive OCSP update errors for a given OCSP response, the
retry delay will be multiplied by 2 for every new failure in order to
avoid retrying too often to update responses for which the responder is
unresponsive (for instance). The maximum delay will still be taken into
account so the OCSP update requests will wtill be sent at least every
hour.

MINOR: ssl: Reorder struct certificate_ocsp members

Just swapping those two 'refcount' and 'response' members enables to
fill two 4 bytes holes in the structure.

MINOR: ssl: Use dedicated proxy and log-format for OCSP update

Instead of using the same proxy as other http client calls (through lua
for instance), the OCSP update will use a dedicated proxy which will
enable it to change the log format and log conditions (for instance).
This proxy will have the NOLOGNORM option and regular logging will be
managed by the update task itself because in order to dump information
related to OCSP updates, we need to control the moment when the logs are
emitted (instead or relying on the stream's life which is decorrelated
from the update itself).
The update task then calls sess_log directly, which uses a dedicated
ocsp logformat that fetches specific OCSP data. Sess_log was preferred
to the more low level app_log because it offers the strength of
"regular" sample fetches and allows to add generic information alongside
OCSP ones in the log line.
In case of connection error (unreachable server for instance), a regular
httpclient log line will also be emitted. This line will have some extra
HTTP related info that can't be provided by the ocsp update logging
mechanism.

MINOR: ssl: Add sample fetches related to OCSP update

This patch adds a series of sample fetches that rely on the specified
OCSP update context structure. They will then be of use only in the
context of an ongoing OCSP update.
They cannot be used directly in the configuration so they won't be made
public. They will be used in the OCSP update's specific log format which
should be emitted by the update task itself in a future patch.

MINOR: ssl: Add 'show ssl ocsp-updates' CLI command

This command can be used to dump information about the entries contained
in the ocsp update tree. It will display one line per concerned OCSP
response and will contain the expected next update time as well as the
time of the last successful update, and the number of successful and
failed attempts.

MINOR: ssl: Add certificate's path to certificate_ocsp structure

In order to have some information about the frontend certificate when
dumping the contents of the ocsp update tree from the cli, we could
either keep a reference to a ckch_store in the certificate_ocsp
structure, which might cause some dangling reference problems, or
simply copy the path to the certificate in the ocsp response structure.
This latter solution was chosen because of its simplicity.

MINOR: ssl: Store specific ocsp update errors in response and update ctx

Those new specific error codes will enable to know a bit better what
went wrong during and OCSP update process. They will come to use in
future sample fetches as well as in debugging means (via the cli or
future traces).

MINOR: ssl: Add ocsp update success/failure counters

Those counters will be used for debugging purposes and will be dumped
via a cli command.

MINOR: ssl: Reinsert ocsp update entries later in case of unknown error

In case of allocation error during the construction of an OCSP request
for instance, we would have ended reinserting the ocsp entry at the same
place in the ocsp update tree which could potentially lead to an
"endless" loop of errors in ssl_ocsp_update_responses. In such a case,
entries are now reinserted further in the tree (1 minute later) in order
to avoid such a chain of alloc failure.

MINOR: ssl: Destroy ocsp update http_client during cleanup

If a deinit is started while an OCSP update is in progress we might end
up with a dangling http_client instance that should be destroyed
properly.

BUG/MINOR: mxu-h1: Report a parsing error on abort with pending data

When an abort is detected before all headers were received, and if there are
pending incoming data, we must report a parsing error instead of a
connection abort. This way it will be able to be handled as an invalid
message by HTTP analyzers instead of an early abort with no message.

It is especially important to be accurate on L7 retry. Indeed, without this
fix, this case will be handle by the "empty-response" retries policy while a
retry on "junk-response" is more accurate.

This patch must be backported to 2.7.

BUG/MEDIUM: http-ana: Don't close request side when waiting for response

A recent fix (af124360e "BUG/MEDIUM: http-ana: Detect closed SC on opposite side
during body forwarding") was pushed to handle to sync a side when the opposite
one is in closing state. However, sometimes, the synchro is performed too early,
preventing a L7 retry to be performed.

Indeed, while the above fix is valid on the reponse side. On the request side,
if the response was not yet received, we must wait before closing.

So, to fix the fix, on the request side, we at least wait the response was
received before finishing the request analysis. Of course, if there is an error,
an abort or anything wrong on the server side, the response analyser should
handle it.

This patch is related to #2061. No backport needed.

BUG/MINOR: http-ana: Do a L7 retry on read error if there is no response

A regression about "empty-response" L7 retry was introduced with the commit
dd6496f591 ("CLEANUP: http-ana: Remove useless if statement about L7
retries").

The if statetement was removed on a wrong assumption. Indeed, L7 retries on
status is now handled in the HTTP analysers. Thus, the stream-connector
(formely the conn-stream, and before again the stream-interface) no longer
report a read error to force a retry. But it is still possible to get a read
error with no response. In this case, we must perform a retry is
"empty-response" is enabled.

So the if statement is re-introduced, reverting the cleanup.

This patch should fix the issue #2061. It must be backported as far as 2.4.

BUG/MINOR: http-ana: Don't increment conn_retries counter before the L7 retry

When we are about to perform a L7 retry, we deal with the conn_retries
counter, to be sure we can retry. However, there is an issue here because
the counter is incremented before it is checked against the backend
limit. So, we can miss a connection retry.

Of course, we must invert both operation. The conn_retries counter must be
incremented after the check agains the backend limit.

This patch must be backported as far as 2.6.

MINOR: quic: notify on send ready

This patch completes the previous one with poller subscribe of quic-conn
owned socket on sendto() error. This ensures that mux-quic is notified
if waiting on sending when a transient sendto() error is cleared. As
such, qc_notify_send() is called directly inside socket I/O callback.

qc_notify_send() internal condition have been thus completed. This will
prevent to notify upper layer until all sending condition are fulfilled:
room in congestion window and no transient error on socket FD.

This should be backported up to 2.7.

MEDIUM: quic: implement poller subscribe on sendto error

On sendto() transient error, prior to this patch sending was simulated
and we relied on retransmission to retry sending. This could hurt
significantly the performance.

Thanks to quic-conn owned socket support, it is now possible to improve
this. On transient error, sending is interrupted and quic-conn socket FD
is subscribed on the poller for sending. When send is possible,
quic_conn_sock_fd_iocb() will be in charge of restart sending.

A consequence of this change is on the return value of qc_send_ppkts().
This function will now return 0 on transient error if quic-conn has its
owned socket. This is used to interrupt sending in the calling function.
The flag QUIC_FL_CONN_TO_KILL must be checked to differentiate a fatal
error from a transient one.

This should be backported up to 2.7.

MINOR: quic: purge txbuf before preparing new packets

Sending is implemented in two parts on quic-conn module. First, QUIC
packets are prepared in a buffer and then sendto() is called with this
buffer as input.

qc.tx.buf is used as the input buffer. It must always be empty before
starting to prepare new packets in it. Currently, this is guarantee by
the fact that either sendto() is completed, a fatal error is encountered
which prevent future send, or a transient error is encountered and we
rely on retransmission to send the remaining data.

This will change when poller subscribe of socket FD on sendto()
transient error will be implemented. In this case, qc.tx.buf will not be
emptied to resume sending when the transient error is cleared. To allow
the current sending process to work as expected, a new function
qc_purge_txbuf() is implemented. It will try to send remaining data
before preparing new packets for sending. If successful, txbuf will be
emptied and sending can continue. If not, sending will be interrupted.

This should be backported up to 2.7.

MINOR: quic: implement qc_notify_send()

Implement qc_notify_send(). This function is responsible to notify the
upper layer subscribed on SUB_RETRY_SEND if sending condition are back
to normal.

For the moment, this patch has no functional change as only congestion
window room is checked before notifying the upper layer. However, this
will be extended when poller subscribe of socket on sendto() error will
be implemented. qc_notify_send() will thus be responsible to ensure that
all condition are met before wake up the upper layer.

This should be backported up to 2.7.

MINOR: quic: simplify return path in send functions

This patch simply clean up return paths used in various send function of
quic-conn module. This will simplify the implementation of poller
subscribing on sendto() error which add another error handling path.

This should be backported up to 2.7.

REGTEST: added tests covering smp_fetch_hdr_ip()

Added new testcases for all 4 branches of smp_fetch_hdr_ip():
- a plain IPv4 address
- an IPv4 address with an port number
- a plain IPv6 address
- an IPv6 address wrapped in [] brackets

BUG/MINOR: http-fetch: recognize IPv6 addresses in square brackets in req.hdr_ip()

If an IPv6 address is enclosed in square brackets [], trim them before
calling inet_pton(). This is to comply with RFC7239 6.1 and RFC3986 3.2.2.

BUG/MINOR: http-check: Skip C-L header for empty body when it's not mandatory

The Content-Length header is always added into the request for an HTTP
health-check. However, when there is no payload, this header may be skipped
for OPTIONS, GET, HEAD and DELETE methods. In fact, it is a "SHOULD NOT" in
the RCF 9110 (#8.6).

It is not really an issue in itself but it seems to be an issue for AWS
ELB. It returns a 400-Bad-Request if a HEAD/GET request with no payload
contains a Content-Length header.

So, it is better to skip this header when possible.

This patch should fix the issue #2026. It could be backported as far as 2.2.

BUG/MINOR: http-check: Don't set HTX_SL_F_BODYLESS flag with a log-format body

When the HTTP request of a health-check is forged, we must not pretend there
is no payload, by setting HTX_SL_F_BODYLESS, if a log-format body was
configured.

Indeed, a test on the body length was used but it is only valid for a plain
string. For A log-format string, a list is used. Note it an bug with no
consequence for now.

This patch must be backported as far as 2.2.

BUG/MINOR: mux-h1: Don't report an error on an early response close

If the response is closed before any data was received, we must not report
an error to the SE descriptor. It is important to be able to retry on an
empty response.

This patch should fix the issue #2061. It must be backported to 2.7.

BUG/MEDIUM: connection: Clear flags when a conn is removed from an idle list

When a connection is removed from the safe list or the idle list,
CO_FL_SAFE_LIST and CO_FL_IDLE_LIST flags must be cleared. It is performed
when the connection is reused. But not when it is moved into the
toremove_conns list. It may be an issue because the multiplexer owning the
connection may be woken up before the connection is really removed. If the
connection flags are not sanitized, it may think the connection is idle and
reinsert it in the corresponding list. From this point, we can imagine
several bugs. An UAF or a connection reused with an invalid state for
instance.

To avoid any issue, the connection flags are sanitized when an idle
connection is moved into the toremove_conns list. The same is performed at
right places in the multiplexers. Especially because the connection release
may be delayed (for h2 and fcgi connections).

This patch shoudld fix the issue #2057. It must carefully be backported as
far as 2.2. Especially on the 2.2 where the code is really different. But
some conflicts should be expected on the 2.4 too.

MINOR: quic: consider EBADF as critical on send()

EBADF on sendto() is considered as a fatal error. As such, it is removed
from the list of the transient errors. The connection will be killed
when encountered.

For the record, EBADF can be encountered on process termination with the
listener socket.

This should be backported up to 2.7.

MEDIUM: quic: improve fatal error handling on send

Send is conducted through qc_send_ppkts() for a QUIC connection. There
is two types of error which can be encountered on sendto() or affiliated
syscalls :
* transient error. In this case, sending is simulated with the remaining
  data and retransmission process is used to have the opportunity to
  retry emission
* fatal error. If this happens, the connection should be closed as soon
  as possible. This is done via qc_kill_conn() function. Until this
  patch, only ECONNREFUSED errno was considered as fatal.

Modify the QUIC send API to be able to differentiate transient and fatal
errors more easily. This is done by fixing the return value of the
sendto() wrapper qc_snd_buf() :
* on fatal error, a negative error code is returned. This is now the
  case for every errno except EAGAIN, EWOULDBLOCK, ENOTCONN, EINPROGRESS
  and EBADF.
* on a transient error, 0 is returned. This is the case for the listed
  errno values above and also if a partial send has been conducted by
  the kernel.
* on success, the return value of sendto() syscall is returned.

This commit will be useful to be able to handle transient error with a
quic-conn owned socket. In this case, the socket should be subscribed to
the poller and no simulated send will be conducted.

This commit allows errno management to be confined in the quic-sock
module which is a nice cleanup.

On a final note, EBADF should be considered as fatal. This will be the
subject of a next commit.

This should be backported up to 2.7.

MINOR: tinfo: make thread_set functions return nth group/mask instead of first

thread_set_first_group() and thread_set_first_tmask() were modified
and renamed to instead return the number and mask of the nth group.
Passing zero continues to return the first one, but it will be more
convenient to use this way when building shards.

CLEANUP: listener: only store conn counts for local threads

The listeners have a thr_conn[] array indexed on the thread number that
is used during connection redispatching to know what threads are the least
loaded. Since we introduced thread groups, and based on the fact that a
listener may only belong to one group, there's no point storing counters
for all threads, we just need to store them for all threads in the group.

Doing so reduces the struct listener from 1500 to 632 bytes. This may be
backported to 2.7 to save a bit of resources.

BUG/MEDIUM: fd: make fd_delete() support being called from a different group

There's currently a problem affecting thread groups. Stopping a listener
from a different group than the one that runs this listener will trigger
the BUG_ON() in fd_delete(). This typically happens by issuing "disable
frontend f" on the CLI for the following config since the CLI runs on
group 1:

    global
        nbthread 2
        thread-groups 2
        stats socket /tmp/sock1 level admin

    frontend f
        mode http
        bind abns@frt-sock thread 2

This happens because abns sockets cannot be suspended so here this
requires a full stop.

A first approach would consist in isolating the caller during such rare
operations but it turns out that fd_delete() is not robust against even
such calling conditions, because it uses its own thread mask with an FD
that may be in a different group, and even though the threads would be
isolated and running_mask should be zero, we must not mix thread masks
from different groups like this.

A better solution consists in replacing the bug condition detection with
a self-protection. After all it's not trivial to figure all likely call
places, and forcing upper layers to protect the code is not clean if we
can do it at the bottom. Thus this is what is being done now. We detect
a thread group mismatch, and if so, we forcefully isolate ourselves and
entirely clean the socket. This has the merit of being much more robust
and easier to use (and harder to misuse). Given that such operations are
very rare (actually when they happen a crash follows), it's not a problem
to waste some time isolating the caller there.

This must be backported to 2.7, along with this previous patch:

  BUG/MINOR: fd: used the update list from the fd's group instead of tgid

BUG/MINOR: fd: used the update list from the fd's group instead of tgid

In _fd_delete_orphan() we try to remove the FD from its update list
which is supposed to be the current thread group's. However the function
might be called from another group during stopping or under isolation,
so FD is not queued in the current group's update list but in its own
group's list. Let's retrieve the group from the FD instead of using
tgid.

This should have no impact on existing code since there is no code path
calling fd_delete() under thread isolation for now, and other cases are
blocked in fd_delete().

This must be backported to 2.7.

DOC: config: Replace TABs by spaces

It is just a small cleanup. All TABs were replaced by spaces.

DOC: config: Clarify the meaning of 'hold' in the 'resolvers' section

This patch improves the 'hold' parameter description in the 'resolvers'
section to make it clearer. It really explains differences between all
status. Thanks to Nick Ramirez for this update.

This patch should solve the issue #1694. It could be backported to all
stable versions.

MEDIUM: mux-quic: Don't expect data from server as long as request is unfinished

As for the H1 and H2 stream, the QUIC stream now states it does not expect
data from the server as long as the request is unfinished. The aim is the
same. We must be sure to not trigger a read timeout on server side if the
client is still uploading data.

From the moment the end of the request is received and forwarded to upper
layer, the QUIC stream reports it expects to receive data from the opposite
endpoint. This re-enables read timeout on the server side.

MEDIUM: mux-h2: Don't expect data from server as long as request is unfinished

As for the H1 stream, the H2 stream now states it does not expect data from
the server as long as the request is unfinished. The aim is the same. We
must be sure to not trigger a read timeout on server side if the client is
still uploading data.

From the moment the end of the request is received and forwarded to upper
layer, the H2 stream reports it expects to receive data from the opposite
endpoint. This re-enables read timeout on the server side.

MEDIUM: mux-h1: Don't expect data from server as long as request is unfinished

On client side, as long as the request is unfinished, the H1 stream states
it does not expect data from the server. It does not mean the server must
not send its response but only it may wait to receive the whole request with
no risk to trigger a read timeout.

When the request is finished, the H1 stream reports it expects to receive
data from the opposite endpoint.

The purpose of this patch is to never report a server timeout on receive if
the client is still uploading data. This way, it is possible to have a
smaller server timeout than the client one.

BUG/MEDIUM: stconn: Report a blocked send if some output data are not consumed

Instead of reporting a blocked send if nothing is send, we do it if some
output data remain blocked after a write attempts or after a call the the
applet's I/O handler. It is mandatory to properly handle write timeouts.

Indeed, if an endpoint is blocked for a while but it partially consumed
output data, no timeout is triggered. It is especially true for
connections. But the same may happen for applet, there is no reason.

Of course, if the endpoint decides to partially consume output data because
it must wait to move on for any reason, it should use the se/applet API
(se/applet_will_consume(), se/applet_wont_consume() and
se/applet_need_more_data()).

This bug was introduced during the channels timeouts refactoring. No
backport is needed.

MINOR: stconn: Report a send activity when endpoint is willing to consume data

When the endpoint (applet or mux) is now willing to consume data while it
said it wouldn't, a send activity is reported. Indeed, the writes was
blocked because of the endpoint. It is now ready to consume outgoing
data. So an send activity must be reported to reset corresponding timers.

Concretly, when the flag SE_FL_WONT_CONSULE is removed, a send activity is
reported.

MEDIUM: stream: Eventually handle stream timeouts when exiting process_stream()

When we exit from process_stream(), if the task is expired, we try to handle
the stream timeouts and we resync the stream-connectors. This avoids a
useless immediate wakeup. It is not really an issue, but it is a small
improvement in edge cases.

MINOR: stream: Handle stream's timeouts in a dedicated function

This will be mandatory to be able to handle stream's timeouts before exiting
process_stream(). So, to not duplicate code, all this stuff is moved in a
dedicated function.

BUG/MINOR: stream: Remove BUG_ON about the task expiration in process_stream()

At the end of process_stream(), A BUG_ON was recently added to abort if we
leave the function with an expired task. However, it may happen if an event
prevents the timeout to be handled but nothing evolved. In this case, the
task expiration is not updated and we expect to catch the timeout on the
immediate task wakeup.

No backport needed.

BUG/MEDIUM: h1-htx: Never copy more than the max data allowed during parsing

A bug during H1 data parsing may lead to copy more data than the maximum
allowed. The bug is an overflow on this max threshold when it is lower than
the size of an htx_blk structure.

At first glance, it means it is possible to not respsect the buffer's
reserve. So it may lead to rewrite errors but it may also block any progress
on the stream if the compression is enabled. In this case, the channel
buffer appears as full and the compression must wait for space to
proceed. Outside of any bug, it is only possible when there are outgoing
data to forward, so the compression filter just waits. Because of this bug,
there is nothing to forward. The buffer is just full of input data. Thus
nothing move and the stream is infinitly blocked.

To fix the bug, we must be sure to be able to create an HTX block of 1 byte
without exceeding the maximum allowed.

This patch should fix the issue #2053. It must be backported as far as 2.5.

BUG/MEDIUM: fd: avoid infinite loops in fd_add_to_fd_list and fd_rm_from_fd_list

With 4d9888c ("CLEANUP: fd: get rid of the __GET_{NEXT,PREV} macros") some
"volatile" keywords were dropped at various assignment places in fd's code.

In fd_add_to_fd_list() and fd_add_to_fd_list(), because of the absence of
the "volatile" keyword: the compiler was able to perform some code
optimizations that prevented prev and next variables from being reloaded
between locking attempts (goto loop).

The result was that fd_add_to_fd_list() and fd_rm_from_fd_list() could enter in
infinite loops, preventing other threads from working further and ultimately
resulting in the watchdog being triggered as described in GH #2011.

To fix this, we made sure to re-audit 4d9888c in order to restore the required
memory barriers / compilers hints to prevent the compiler from mis-optimizing
the code around the fd's locks.
That is: using atomic loads to fetch the prev and next values, and restoring
the "volatile" cast for cur_list.ent variable assignment in fd_rm_from_fd_list()

Big thanks to @xanaxalan for his help and patience and to @wtarreau for his
findings and explanations in regard to compiler's optimizations.

This must be backported in 2.7 with 4d9888c ("CLEANUP: fd: get rid of the
__GET_{NEXT,PREV} macros")

BUILD: thead: Fix several 32 bits compilation issues with uint64_t variables

Cast uint64_t as ullong and difference between two uint64_t as llong.

MINOR: config: add HAPROXY_BRANCH environment variable

This patch adds support from HAPROXY_BRANCH environment variable.
It can be useful is some resources are loaded from different
locations when migrating from one version to another.

Signed-off-by: Sébastien Gross <sgross@haproxy.com>

CLEANUP: ring: remove the now unused ring's offset

Since the previous patch, the ring's offset is not used anymore. The
haring utility remains backward-compatible since it can trust the
buffer element that's at the beginning of the map and which still
contains all the valid data.

MEDIUM: ring: make the offset relative to the head/tail instead of absolute

The ring's offset currently contains a perpetually growing custor which
is the number of bytes written from the start. It's used by readers to
know where to (re)start reading from. It was made absolute because both
the head and the tail can change during writes and we needed a fixed
position to know where the reader was attached. But this is complicated,
error-prone, and limits the ability to reduce the lock's coverage. In
fact what is needed is to know where the reader is currently waiting, if
at all. And this location is exactly where it stored its count, so the
absolute position in the buffer (the seek offset from the first storage
byte) does represent exactly this, as it doesn't move (we don't realign
the buffer), and is stable regardless of how head/tail changes with writes.

This patch modifies this so that the application code now uses this
representation instead. The most noticeable change is the initialization,
where we've kept ~0 as a marker to go to the end, and it's now set to
the tail offset instead of trying to resolve the current write offset
against the current ring's position.

The offset was also used at the end of the consuming loop, to detect
if a new write had happened between the lock being released and taken
again, so as to wake the consumer(s) up again. For this we used to
take a copy of the ring->ofs before unlocking and comparing with the
new value read in the next lock. Since it's not possible to write past
the current reader's location, there's no risk of complete rollover, so
it's sufficient to check if the tail has changed.

Note that the change also has an impact on the haring consumer which
needs to adapt as well. But that's good in fact because it will rely
on one less variable, and will use offsets relative to the buffer's
head, and the change remains backward-compatible.

BUG/MINOR: ring: do not realign ring contents on resize

If a ring is resized, we must not zero its head since the contents
are preserved in-situ. Till now it used to work because we only resize
during boot and we emit very few data (if at all) during boot. But this
can change in the future.

This can be backported to 2.2 though no older version should notice a
difference.

BUILD: quic: 32-bits compilation issue with %zu in quic_rx_pkts_del()

This issue arrived with this commit:
    1dbeb35f8 MINOR: quic: Add new traces about by connection RX buffer handling

and revealed by the GH CI as follows:

   src/quic_conn.c: In function â\80\98quic_rx_pkts_delâ\80\99:
    include/haproxy/trace.h:134:65: error: format â\80\98%zuâ\80\99 expects argument of type â\80\98size_tâ\80\99,
    but argument 6 has type â\80\98uint64_tâ\80\99 {aka â\80\98long long unsigned intâ\80\99} [-Werror=format=]
    _msg_len = snprintf(_msg, sizeof(_msg), (fmt), ##args);

Replace all %zu printf integer format by %llu.

Must be backported to 2.7 where the previous is supposed to be backported.

MINOR: haproxy: always protocol unbind on startup error path

In haproxy startup, all init error paths after the protocol bind step
cautiously call protocol_unbind_all() before exiting except one that was
conditional. We're not making an exception to the rule and we now properly
call protocol_unbind_all() as well.

No backport needed as this patch is unnoticeable.

MINOR: proto_ux: ability to dump ABNS names in error messages

In sock_unix_bind_receiver(), uxst_bind_listener() and uxdg_bind_listener(),
properly dump ABNS socket names by leveraging sa2str() function which does the
hard work for us.

UNIX sockets are reported as is (unchanged) while ABNS UNIX sockets
are prefixed with 'abns@' to match the syntax used in config file.
(they where previously showing as empty strings because of the leading
NULL-byte that was not properly handled in this case)

This is only a minor debug improvement, however it could be useful to
backport it up to 2.4.
[for 2.4: you should replace "%s [%s]" by "%s for [%s]" for uxst and uxgd if
you wan't the patch to apply properly]

MEDIUM: proto_ux: properly suspend named UNIX listeners

When a listener is suspended, we expect that it may not process any data for
the time it is suspended.

Yet for named UNIX socket, as the suspend operation is a no-op at the proto
level, recv events on the socket may still be processed by the polling loop.

This is quite disturbing as someone may rely on a paused proxy being harmless,
which is true for all protos except for named UNIX sockets.

To fix this behavior, we explicitely disable io recv events when suspending a
named UNIX socket listener (we call disable() method on the listener).

The io recv events will automatically be restored when the listener is resumed
since the l->enable() method is called at the end of the resume() operation.

This could be backported up to 2.4 after a reasonable observation
period to make sure that this change doesn't cause unwanted side-effects.

BUG/MINOR: sock_unix: match finalname with tempname in sock_unix_addrcmp()

In sock_unix_addrcmp(), named UNIX sockets paths are manually compared in
order to properly handle tempname paths (ending with ".XXXX.tmp") that result
from the 2-step bind implemented in sock_unix_bind_receiver().

However, this logic does not take into account "final" path names (without the
".XXXX.tmp" suffix).

Example:
  /tmp/test did not match with /tmp/test.1288.tmp prior to this patch

Indeed, depending on how the socket addr is retrieved, the same socket
could be designated either by its tempname or finalname.
socket addr is normally stored with its finalname within a receiver, but a
call to getsockname() on the same socket will return the tempname that was
used for the bind() call (sock_get_old_sockets() depends on getsockname()).

This causes sock_find_compatible_fd() to malfunction with named UNIX
sockets (ie: haproxy -x CLI option).

To fix this, we slightly modify the check around the temp suffix in
sock_unix_addrcmp(): we perform the suffix check even if one of the
paths is lacking the temp suffix (with proper precautions).

Now the function is able to match:
  - finalname x finalname
  - tempname  x tempname
  - finalname x tempname

That is: /tmp/test == /tmp/test.1288.tmp == /tmp/test.X.tmp

It should be backported up to 2.4

BUG/MEDIUM: listener/proxy: fix listeners notify for proxy resume

In 58651b42f ("MEDIUM: listener/proxy: make the listeners notify about
proxy pause/resume") we introduced the logic for pause/resume notify using
li_ready for pause and li_paused for resume.

Unfortunately, relying on li_paused for resume doesn't work reliably if we
resume a listener which is only made of receivers that are completely stopped.
For example, this could happen with receivers that don't support the
LI_PAUSED state like ABNS sockets.

This is especially true since pause_listener() was renamed to
suspend_listener() to better reflect its actual behavior in
("MINOR: listener:  pause_listener() becomes suspend_listener())

To fix this, we now rely on the li_suspended state in resume_listener() to make
sure that suspend_listener() and resume_listener() notify messages are
consistent to each other:

"Proxy pause" is triggered when there are no more ready listeners.
"Proxy resume" is triggered when there are no more suspended listeners.

Also, we make use of the new PR_FL_PAUSED proxy flag to make sure we don't
report the same event twice.

This could be backported up to 2.4 after a reasonable observation
period to make sure that this change doesn't cause unwanted side-effects.

--
Backport notes:

This commit depends on:
- "MINOR: listener: pause_listener() becomes suspend_listener()"

-> 2.4 only, as "MINOR: proxy/listener: support for additional PAUSED state"
was not backported:

Replace this:

    |+       if (px && !(px->flags & PR_FL_PAUSED) && !px->li_ready) {
    |                /* PROXY_LOCK is required */
    |                proxy_cond_pause(px);
    |                ha_warning("Paused %s %s.\n", proxy_cap_str(px->cap), px->id);

By this:

    |+       if (px && !px->li_ready) {
    |                ha_warning("Paused %s %s.\n", proxy_cap_str(px->cap), px->id);
    |                send_log(px, LOG_WARNING, "Paused %s %s.\n", proxy_cap_str(px->cap), px->id);
    |        }

And this:

    |+       if (px && (px->flags & PR_FL_PAUSED) && !px->li_suspended) {
    |                /* PROXY_LOCK is required */
    |                proxy_cond_resume(px);
    |                ha_warning("Resumed %s %s.\n", proxy_cap_str(px->cap), px->id);

By this:

    |+       if (px && !px->li_suspended) {
    |                ha_warning("Resumed %s %s.\n", proxy_cap_str(px->cap), px->id);
    |                send_log(px, LOG_WARNING, "Resumed %s %s.\n", proxy_cap_str(px->cap), px->id);
    |        }

MINOR: listener: pause_listener() becomes suspend_listener()

We are simply renaming pause_listener() to suspend_listener() to prevent
confusion around listener pausing.

A suspended listener can be in two differents valid states:
- LI_PAUSED: the listener is effectively paused, it will unpause on
   resume_listener()
- LI_ASSIGNED (not bound): the listener does not support the LI_PAUSED
   state, so it was unbound to satisfy the suspend request, it will
   correcly re-bind on resume_listener()

Besides that, we add the LI_F_SUSPENDED flag to mark suspended listeners in
suspend_listener() and unmark them in resume_listener().

We're also adding li_suspend proxy variable to track the number of currently
suspended listeners:
That is, the number of listeners that were suspended through suspend_listener()
and that are either in LI_PAUSED or LI_ASSIGNED state.

Counter is increased on successful suspend in suspend_listener() and it is
decreased on successful resume in resume_listener()

--
Backport notes:

-> 2.4 only, as "MINOR: proxy/listener: support for additional PAUSED state"
was not backported:

Replace this:

    |                /* PROXY_LOCK is require
    |                proxy_cond_resume(px);

By this:

    |                ha_warning("Resumed %s %s.\n", proxy_cap_str(px->cap), px->id);
    |                send_log(px, LOG_WARNING, "Resumed %s %s.\n", proxy_cap_str(px->cap), px->id);

-> 2.6 and 2.7 only, as "MINOR: listener: make sure we don't pause/resume" was
custom patched:

Replace this:

    |@@ -253,6 +253,7 @@ struct listener {
    |
    | /* listener flags (16 bits) */
    | #define LI_F_FINALIZED           0x0001  /* listener made it to the READY||LIMITED||FULL state at least once, may be suspended/resumed safely */
    |+#define LI_F_SUSPENDED           0x0002  /* listener has been suspended using suspend_listener(), it is either is LI_PAUSED or LI_ASSIGNED state */
    |
    | /* Descriptor for a "bind" keyword. The ->parse() function returns 0 in case of
    |  * success, or a combination of ERR_* flags if an error is encountered. The

By this:

    |@@ -222,6 +222,7 @@ struct li_per_thread {
    |
    | #define LI_F_QUIC_LISTENER       0x00000001  /* listener uses proto quic */
    | #define LI_F_FINALIZED           0x00000002  /* listener made it to the READY||LIMITED||FULL state at least once, may be suspended/resumed safely */
    |+#define LI_F_SUSPENDED           0x00000004  /* listener has been suspended using suspend_listener(), it is either is LI_PAUSED or LI_ASSIGNED state */
    |
    | /* The listener will be directly referenced by the fdtab[] which holds its
    |  * socket. The listener provides the protocol-specific accept() function to

BUG/MEDIUM: resume from LI_ASSIGNED in default_resume_listener()

Since fc974887c ("MEDIUM: protocol: explicitly start the receiver before
the listener"), resume from LI_ASSIGNED state does not work anymore.

This is because the binding part has been divided into 2 distinct steps
since: first bind(), then listen().

This new logic was properly implemented in startup sequence
through protocol_bind_all() but wasn't properly reported in
default_resume_listener() function.

Fixing default_resume_listener() to comply with the new logic.

This should help ABNS sockets to properly rebind in resume_listener()
after they have been stopped by pause_listener():
See Redmine:4475 for more context.

This commit depends on:
- "MINOR: listener: workaround for closing a tiny race between resume_listener() and stopping"
- "MINOR: listener: make sure we don't pause/resume bypassed listeners"

This could be backported up to 2.4 after a reasonable observation period to
make sure that this change doesn't cause unwanted side-effects.

BUG/MINOR: listener: fix resume_listener() resume return value handling

In resume_listener(), proto->resume() errors were not properly handled:
the function kept flowing down as if no errors were detected.

Instead, we're performing an early return when such errors are detected to
prevent undefined behaviors.

This could be backported up to 2.4.

--
Backport notes:

This commit depends on:
- "MINOR: listener: make sure we don't pause/resume bypassed listeners"

-> 2.4 ... 2.7:

Replace this:

    |        if (l->bind_conf->maxconn && l->nbconn >= l->bind_conf->maxconn) {
    |                l->rx.proto->disable(l);

By this:

    |        if (l->maxconn && l->nbconn >= l->maxconn) {
    |                l->rx.proto->disable(l);

BUG/MEDIUM: listener: fix pause_listener() suspend return value handling

Legacy suspend() return value handling in pause_listener() has been altered
over the time.

First with fb76bd5ca ("BUG/MEDIUM: listeners: correctly report pause() errors")
Then with e03204c8e ("MEDIUM: listeners: implement protocol level
->suspend/resume() calls")

We aim to restore original function behavior and comply with resume_listener()
function description.
This is required for resume_listener() and pause_listener() to work as a whole

Now, it is made explicit that pause_listener() may stop a listener if the
listener doesn't support the LI_PAUSED state (depending on the protocol
family, ie: ABNS sockets), in this case l->state will be set to LI_ASSIGNED
and this won't be considered as an error.

This could be backported up to 2.4 after a reasonable observation period
to make sure that this change doesn't cause unwanted side-effects.

--
Backport notes:

This commit depends on:
- "MINOR: listener: make sure we don't pause/resume bypassed listeners"

-> 2.4: manual change required because "MINOR: proxy/listener: support
for additional PAUSED state" was not backported: the contextual patch
lines don't match.

Replace this:

    |        if (px && !px->li_ready) {
    |                /* PROXY_LOCK is required */

By this:

    |        if (px && !px->li_ready) {
    |               ha_warning("Paused %s %s.\n", proxy_cap_str(px->cap), px->id);

MINOR: listener: make sure we don't pause/resume bypassed listeners

Some listeners are kept in LI_ASSIGNED state but are not supposed to be
started since they were bypassed on initial startup (eg: in protocol_bind_all()
or in enable_listener()...)

Introduce the LI_F_FINALIZED flag: when the variable is non
zero it means that the listener made it past the LI_LISTEN state (finalized)
at least once so we can safely pause / resume. This way we won't risk starting
a previously bypassed listener which never made it that far and thus was not
expected to be lazy-started by accident.

As listener_pause() and listener_resume() are currently partially broken, such
unexpected lazy-start won't happen. But we're trying to restore pause() and
resume() behavior so this patch will be required before going any further.

We had to re-introduce listeners 'flags' struct member since it was recently
moved into bind_conf struct. But here we do have a legitimate need for these
listener-only flags.

This should only be backported if explicitly required by another commit.
--
Backport notes:

-> 2.4 and 2.5:

The 2-bytes hole we're using in the current patch does not apply, let's
use the 4-byte hole located under the 'option' field.

Replace this:

    |@@ -226,7 +226,8 @@ struct li_per_thread {
    | struct listener {
    |        enum obj_type obj_type;         /* object type = OBJ_TYPE_LISTENER */
    |        enum li_state state;            /* state: NEW, INIT, ASSIGNED, LISTEN, READY, FULL */
    |-       /* 2-byte hole here */
    |+       uint16_t flags;                 /* listener flags: LI_F_* */
    |        int luid;                       /* listener universally unique ID, used for SNMP */
    |        int nbconn;                     /* current number of connections on this listener */
    |        unsigned int thr_idx;           /* thread indexes for queue distribution : (t2<<16)+t1 */

By this:

    |@@ -209,6 +209,8 @@ struct listener {
    |        short int nice;                 /* nice value to assign to the instantiated tasks */
    |        int luid;                       /* listener universally unique ID, used for SNMP */
    |        int options;                    /* socket options : LI_O_* */
    |+       uint16_t flags;                 /* listener flags: LI_F_* */
    |+       /* 2-bytes hole here */
    |        __decl_thread(HA_RWLOCK_T lock);
    |
    |        struct fe_counters *counters;   /* statistics counters */

-> 2.4 only:
We need to adjust some contextual lines.
Replace this:

    |@@ -477,7 +478,7 @@ int pause_listener(struct listener *l, int lpx, int lli)
    |        if (!lli)
    |                HA_RWLOCK_WRLOCK(LISTENER_LOCK, &l->lock);
    |
    |-       if (l->state <= LI_PAUSED)
    |+       if (!(l->flags & LI_F_FINALIZED) || l->state <= LI_PAUSED)
    |                goto end;
    |
    |        if (l->rx.proto->suspend)

By this:

    |@@ -477,7 +478,7 @@ int pause_listener(struct listener *l, int lpx, int lli)
    |            !(proc_mask(l->rx.settings->bind_proc) & pid_bit))
    |                goto end;
    |
    |-       if (l->state <= LI_PAUSED)
    |+       if (!(l->flags & LI_F_FINALIZED) || l->state <= LI_PAUSED)
    |                goto end;
    |
    |        if (l->rx.proto->suspend)

And this:

    |@@ -535,7 +536,7 @@ int resume_listener(struct listener *l, int lpx, int lli)
    |        if (MT_LIST_INLIST(&l->wait_queue))
    |                goto end;
    |
    |-       if (l->state == LI_READY)
    |+       if (!(l->flags & LI_F_FINALIZED) || l->state == LI_READY)
    |                goto end;
    |
    |        if (l->rx.proto->resume)

By this:

    |@@ -535,7 +536,7 @@ int resume_listener(struct listener *l, int lpx, int lli)
    |            !(proc_mask(l->rx.settings->bind_proc) & pid_bit))
    |                goto end;
    |
    |-       if (l->state == LI_READY)
    |+       if (!(l->flags & LI_F_FINALIZED) || l->state == LI_READY)
    |                goto end;
    |
    |        if (l->rx.proto->resume)

-> 2.6 and 2.7 only:

struct listener 'flags' member still exists, let's use it.

Remove this from the current patch:

    |@@ -226,7 +226,8 @@ struct li_per_thread {
    | struct listener {
    |        enum obj_type obj_type;         /* object type = OBJ_TYPE_LISTENER */
    |        enum li_state state;            /* state: NEW, INIT, ASSIGNED, LISTEN, READY, FULL */
    |-       /* 2-byte hole here */
    |+       uint16_t flags;                 /* listener flags: LI_F_* */
    |        int luid;                       /* listener universally unique ID, used for SNMP */
    |        int nbconn;                     /* current number of connections on this listener */
    |        unsigned int thr_idx;           /* thread indexes for queue distribution : (t2<<16)+t1 */

Then, replace this:

    |@@ -251,6 +250,9 @@ struct listener {
    |        EXTRA_COUNTERS(extra_counters);
    | };
    |
    |+/* listener flags (16 bits) */
    |+#define LI_F_FINALIZED           0x0001  /* listener made it to the READY||LIMITED||FULL state at least once, may be suspended/resumed safely */
    |+
    | /* Descriptor for a "bind" keyword. The ->parse() function returns 0 in case of
    |  * success, or a combination of ERR_* flags if an error is encountered. The
    |  * function pointer can be NULL if not implemented. The function also has an

By this:

    |@@ -221,6 +221,7 @@ struct li_per_thread {
    | };
    |
    | #define LI_F_QUIC_LISTENER       0x00000001  /* listener uses proto quic */
    |+#define LI_F_FINALIZED           0x00000002  /* listener made it to the READY||LIMITED||FULL state at least once, may be suspended/resumed safely */
    |
    | /* The listener will be directly referenced by the fdtab[] which holds its
    |  * socket. The listener provides the protocol-specific accept() function to

MINOR: listener: workaround for closing a tiny race between resume_listener() and stopping

This is an alternative fix that tries to address the same issue as
d1ebee177 ("BUG/MINOR: listener: close tiny race between
resume_listener() and stopping") while allowing resume_listener() to be
more versatile.

Indeed, because of the previous fix, resume_listener() is not able to
rebind stopped listeners, and this breaks the original behavior that is
documented in the function description:

"If the listener was only in the assigned
  state, it's totally rebound. This can happen if a pause() has completely
  stopped it. If the resume fails, 0 is returned and an error might be
  displayed."

With relax_listener(), we now make sure to check l->state under the
listener lock so we don't call resume_listener() when the conditions are not
met.

As such, concurrently stopped listeners may not be rebound using
relax_listener().

Note: the documented race can't happen since 1b927eb3c ("MEDIUM: proto: stop
protocols under thread isolation during soft stop"), but older versions are
concerned as 1b927eb3c was not marked for backports.
Moreover, the patch also prevents the race between protocol_pause_all() and
resuming from LIMITED or FULL states.

This commit depends on:
  - "MINOR: listener: add relax_listener() function"

This should be backported with d1ebee177 up to 2.4
(d1ebee177 is marked to be backported for all stable versions but the current
patch does not apply for versions < 2.4)

MINOR: listener: add relax_listener() function

There is a need for a small difference between resuming and relaxing
a listener.

When resuming, we expect that the listener may completely resume, this includes
unpausing or rebinding if required.
Resuming a listener is a best-effort operation: no matter the current state,
try our best to bring the listener up to the LI_READY state.

There are some cases where we only want to "relax" listeners that were
previously restricted using limit_listener() or listener_full() functions.
Here we don't want to ressucitate listeners, we're simply interested in
cancelling out the previous restriction.

To this day, listener_resume() on a unbound listener is broken, that's why
the need for this wasn't felt yet.

But we're trying to restore historical listener_resume() behavior, so we better
prepare for this by introducing an explicit relax_listener() function that
only does what is expected in such cases.

This commit depends on:
- "MINOR: listener/api: add lli hint to listener functions"

MINOR: listener/api: add lli hint to listener functions

Add listener lock hint (AKA lli) to (stop/resume/pause)_listener() functions.
All these functions implicitely take the listener lock when they are called:
It could be useful to be able to call them while already holding the lock, so
we're adding lli hint to make them take the lock only when it is missing.

This should only be backported if explicitly required by another commit
--

-> 2.4 and 2.5 common backport notes:

These 2 commits need to be backported first:
- 187396e34 "CLEANUP: listener: function comment typo in stop_listener()"
- a57786e87 "BUG/MINOR: listener: null pointer dereference suspected by
   coverity"

-> 2.4 special backport notes:

In addition to the previously mentionned dependencies, the patch needs to be
slightly adapted to match the corresponding contextual lines:

Replace this:

    |@@ -471,7 +474,8 @@ int pause_listener(struct listener *l, int lpx)
    |        if (!lpx && px)
    |                HA_RWLOCK_WRLOCK(PROXY_LOCK, &px->lock);
    |
    |-       HA_RWLOCK_WRLOCK(LISTENER_LOCK, &l->lock);
    |+       if (!lli)
    |+               HA_RWLOCK_WRLOCK(LISTENER_LOCK, &l->lock);
    |
    |        if (l->state <= LI_PAUSED)
    |                goto end;

By this:

    |@@ -471,7 +474,8 @@ int pause_listener(struct listener *l, int lpx)
    |        if (!lpx && px)
    |                HA_RWLOCK_WRLOCK(PROXY_LOCK, &px->lock);
    |
    |-       HA_RWLOCK_WRLOCK(LISTENER_LOCK, &l->lock);
    |+       if (!lli)
    |+               HA_RWLOCK_WRLOCK(LISTENER_LOCK, &l->lock);
    |
    |        if ((global.mode & (MODE_DAEMON | MODE_MWORKER)) &&
    |            !(proc_mask(l->rx.settings->bind_proc) & pid_bit))

Replace this:

    |@@ -169,7 +169,7 @@ void protocol_stop_now(void)
    |        HA_SPIN_LOCK(PROTO_LOCK, &proto_lock);
    |        list_for_each_entry(proto, &protocols, list) {
    |                list_for_each_entry_safe(listener, lback, &proto->receivers, rx.proto_list)
    |-                       stop_listener(listener, 0, 1);
    |+                       stop_listener(listener, 0, 1, 0);
    |        }
    |        HA_SPIN_UNLOCK(PROTO_LOCK, &proto_lock);
    | }

By this:

    |@@ -169,7 +169,7 @@ void protocol_stop_now(void)
    |        HA_SPIN_LOCK(PROTO_LOCK, &proto_lock);
    |        list_for_each_entry(proto, &protocols, list) {
    |                list_for_each_entry_safe(listener, lback, &proto->receivers, rx.proto_list)
    |                        if (!listener->bind_conf->frontend->grace)
    |-                               stop_listener(listener, 0, 1);
    |+                               stop_listener(listener, 0, 1, 0);
    |        }
    |        HA_SPIN_UNLOCK(PROTO_LOCK, &proto_lock);

Replace this:

    |@@ -2315,7 +2315,7 @@ void stop_proxy(struct proxy *p)
    |        HA_RWLOCK_WRLOCK(PROXY_LOCK, &p->lock);
    |
    |        list_for_each_entry(l, &p->conf.listeners, by_fe)
    |-               stop_listener(l, 1, 0);
    |+               stop_listener(l, 1, 0, 0);
    |
    |        if (!(p->flags & (PR_FL_DISABLED|PR_FL_STOPPED)) && !p->li_ready) {
    |                /* might be just a backend */

By this:

    |@@ -2315,7 +2315,7 @@ void stop_proxy(struct proxy *p)
    |        HA_RWLOCK_WRLOCK(PROXY_LOCK, &p->lock);
    |
    |        list_for_each_entry(l, &p->conf.listeners, by_fe)
    |-               stop_listener(l, 1, 0);
    |+               stop_listener(l, 1, 0, 0);
    |
    |        if (!p->disabled && !p->li_ready) {
    |                /* might be just a backend */

MINOR: proto_uxst: add resume method

resume method was not explicitly defined for uxst protocol family.
Here we can safely use the default_resume_listener, just like the uxdg family.

This could be backported up to 2.4.

BUG/MINOR: protocol: fix minor memory leak in protocol_bind_all()

In protocol_bind_all() (involved in startup sequence):
We only free errmsg (set by fam->bind() attempt) when we make use of it.
But this could lead to some memory leaks because there are some cases
where we ignore the error message (e.g: verbose=0 with ERR_WARN messages).

As long as errmsg is set, we should always free it.

As mentioned earlier, this really is a minor leak because it can only occur on
specific conditions (error paths) during the startup phase.

This may be backported up to 2.4.

--
Backport notes:

-> 2.4 only:

Replace this:

    |                                        ha_warning("Binding [%s:%d] for %s %s: %s\n",
    |                                                   listener->bind_conf->file, listener->bind_conf->line,
    |                                                   proxy_type_str(px), px->id, errmsg);

By this:

    |                                else if (lerr & ERR_WARN)
    |                                        ha_warning("Starting %s %s: %s\n",
    |                                                   proxy_type_str(px), px->id, errmsg);

BUG/MINOR: proto_ux: report correct error when bind_listener fails

In uxst_bind_listener() and uxdg_bind_listener(), when the function
fails because the listener is not bound, both function are setting
the error message but don't set the err status before returning.

Because of this, such error is not properly handled by the upper functions.

Making sure this error is properly catched by returning a composition of
ERR_FATAL and ERR_ALERT.

This could be backported up to 2.4.

MINOR: stconn: Add functions to set/clear SE_FL_EXP_NO_DATA flag from endpoint

se_expect_data() and se_expect_no_data() should be used from the endpoint to
inform upper layer it expects data or not from the opposite endpoint.

REGTESTS: cache: Use rxresphdrs to only get headers for 304 responses

304 responses contains "Content-length" or "Transfer-encoding"
headers. rxresp action expects to get a payload in this case, even if 304
reponses must not have any payload. A workaround was added to remove these
headers from the 304 responses. However, a better solution is to only get
the response headers from clients using rxresphdrs action.

If a payload is erroneously added in these reponses, the scripts will fail
the same way. So it is safe.

MINOR: stconn: Remove half-closed timeout

The half-closed timeout is now directly retrieved from the proxy
settings. There is no longer usage for the .hcto field in the stconn
structure. So let's remove it.

MINOR: stconn: Set half-close timeout using proxy settings

We now directly use the proxy settings to set the half-close timeout of a
stream-connector. The function sc_set_hcto() must be used to do so. This
timeout is only set when a shutw is performed. So it is not really a big
deal to use a dedicated function to do so.

CLEANUP: stconn: Remove old read and write expiration dates

Old read and write expiration dates are no longer used. Thus we can safely
remove them.

MINOR: stconn: Always report READ/WRITE event on shutr/shutw

It was done by hand by callers when a shutdown for read or write was
performed. It is now always handled by the functions performing the
shutdown. This way the callers don't take care of it. This will avoid some
bugs.

MINOR: stream: Use relative expiration date in trace messages

Expiration dates in trace messages are now relative to now_ms. It is fairly
easier to read traces this way. And an expired date is now negative. So, it
is also easy to detect when a timeout was reached.

MINOR: stream: Report rex/wex value using the sedesc date in trace messages

Becasue read and write timeout are now detected using .lra and .fsb fields
of the stream-endpoint descriptor, it is better to also use these fields to
report read and write expiration date in trace messages. Especially because
old rex and wex fields will be removed.

MINOR: stream: Dump the task expiration date in trace messages

The expiration date of the stream's task is now diplayed in the trace
messages. This will help to track changes in the stream traces.

MAJOR: stream: Use SE descriptor date to detect read/write timeouts

We stop to use the channel's expiration dates to detect read and write
timeouts on the channels. We now rely on the stream-endpoint descriptor to
do so. All the stuff is handled in process_stream().

The stream relies on 2 helper functions to know if the receives or sends may
expire: sc_rcv_may_expire() and sc_snd_may_expire().

MINOR: applet/stconn: Add a SE flag to specify an endpoint does not expect data

An endpoint should now set SE_FL_EXP_NO_DATA flag if it does not expect any
data from the opposite endpoint. This way, the stream will be able to
disable any read timeout on the opposite endpoint. Applets should use
applet_expect_no_data() and applet_expect_data() functions to set or clear
the flag. For now, only dns and sink forwarder applets are concerned.