BUG/MINOR: quic: Possible leak of TX packets under heavy load
This bug could be reproduced with -dMfail and detected added a counter of TX packet
to the QUIC connection. When released calling quic_conn_release() the connection
should have a null counter of TX packets. This was not always the case.
This could occur during the handshake step: a first packet was built, then another
one should have followed in the same datagram, but fail due to a memory allocation
issue. As the datagram length and first TX packet were not written in the TX
buffer, this latter could not really be purged by qc_purge_tx_buf() even if
called. This bug occured only when building coalesced packets in the same datagram.
To fix this, write the packet information (datagram length and first packet
address) in the TX buffer before purging it.
BUG/MEDIUM: quic: Possible crash during retransmissions and heavy load
This bug could be reproduced with -dMfail and dectected by libasan as follows:
$ ASAN_OPTIONS=disable_coredump=0:unmap_shadow_on_exit=1:abort_on_error=f quic-freeze.cfg -dMfail -dMno-cache -dM0x55
=================================================================
==82989==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7ffc 0x560790cc4749 bp 0x7fff8e0e8e30 sp 0x7fff8e0e8e28
WRITE of size 8 at 0x7fff8e0ea338 thread T0
#0 0x560790cc4748 in qc_frm_free src/quic_frame.c:1222
#1 0x560790cc5260 in qc_release_frm src/quic_frame.c:1261
#2 0x560790d1de99 in qc_treat_acked_tx_frm src/quic_rx.c:312
#3 0x560790d1e708 in qc_ackrng_pkts src/quic_rx.c:370
#4 0x560790d22a1d in qc_parse_ack_frm src/quic_rx.c:694
#5 0x560790d25daa in qc_parse_pkt_frms src/quic_rx.c:988
#6 0x560790d2a509 in qc_treat_rx_pkts src/quic_rx.c:1373
#7 0x560790c72d45 in quic_conn_io_cb src/quic_conn.c:906
#8 0x560791207847 in run_tasks_from_lists src/task.c:596
#9 0x5607912095f0 in process_runnable_tasks src/task.c:876
#10 0x560791135564 in run_poll_loop src/haproxy.c:2966
#11 0x5607911363af in run_thread_poll_loop src/haproxy.c:3165
#12 0x56079113938c in main src/haproxy.c:3862
#13 0x7f92606edd09 in __libc_start_main ../csu/libc-start.c:308
#14 0x560790bcd529 in _start (/home/flecaille/src/haproxy/haproxy+0x
Address 0x7fff8e0ea338 is located in stack of thread T0 at offset 1032 i
#0 0x560790d29b52 in qc_treat_rx_pkts src/quic_rx.c:1341
This frame has 2 object(s):
[32, 48) 'ar' (line 1380)
[64, 1088) '_msg' (line 1368) <== Memory access at offset 1032 is inable
HINT: this may be a false positive if your program uses some custom stacnism, swapcontext or vfork
(longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: stack-use-after-scope src/quic_frame.c:1222 i
Shadow bytes around the buggy address:
0x100071c15410: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x100071c15420: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x100071c15430: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x100071c15440: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x100071c15450: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
=>0x100071c15460: f8 f8 f8 f8 f8 f8 f8[f8]f8 f8 f8 f8 f8 f8 f3 f3
0x100071c15470: f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 00 00
0x100071c15480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x100071c15490: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x100071c154a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x100071c154b0: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 04 f3 f3 f3
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
Shadow gap: cc
==82989==ABORTING
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer:DEADLYSIGNAL
Aborted (core dumped)
Note that a coredump could not always be produced with all compilers. This was
always the case with clang 11.
When allocating frames to be retransmitted from qc_dgrams_retransmit(), if they
could not be sent for any reason, they could remain attached to a local list to
qc_dgrams_retransmit() and trigger a crash with libasan when releasing the
original frames they were duplicated from.
To fix this, always release the frames which could not be sent during
retransmissions calling qc_free_frm_list() where needed.
MINOR: quic: Add traces to debug frames handling during retransmissions
This is really boring to not know why some retransmissions could not be done
from qc_prep_hpkts() which allocates frames, prepare packets and send them.
Especially to not know about if frames are not remaining allocated and
attached to list on the stack. This patch already helped in diagnosing
such an issue during "-dMfail" tests.
Willy Tarreau [Mon, 20 Nov 2023 18:30:42 +0000 (19:30 +0100)]
BUILD: log: silence a build warning when threads are disabled
Building without threads emits two warnings because the proxy pointer
is no longer used (only serves for the lock) since 2.9 commit 9a74a6cb1
("MAJOR: log: introduce log backends"). No backport is needed.
Amaury Denoyelle [Fri, 17 Nov 2023 15:23:43 +0000 (16:23 +0100)]
DEBUG: connection/flags: update flags for reverse HTTP
Add missing CO_FL_REVERSED and CO_FL_ACT_REVERSING flag definitions in
conn_show_flags(). These flags were introduced in this release with
reverse HTTP support.
Amaury Denoyelle [Mon, 20 Nov 2023 14:45:36 +0000 (15:45 +0100)]
MINOR: quic: remove unneeded QUIC specific stopping function
On CONNECTION_CLOSE reception/emission, QUIC connections enter CLOSING
state. At this stage, only CONNECTION_CLOSE can be reemitted and all
other exchanges are stopped.
Previously, on haproxy process stopping, if all QUIC connections were in
CLOSING state, they were released before their closing timer expiration
to not block the process shutdown. However, since a recent commit, the
closing timer has been shorten to a more reasonable delay. It is now
consider viable to respect connections closing state even on process
shutdown. As such, stopping specific code in QUIC connections idle timer
task was removed.
A specific function quic_handle_stopping() was implemented to notify
QUIC connections on shutdown from main() function. It should have been
deleted along the removal in QUIC idle timer task. This patch just does
this.
BUG/MEDIUM: quic: Possible crash for connections to be killed
The connections are flagged as "to be killed" asap when the peer has left
(detected by sendto() "Connection refused" errno) by qc_kill_conn(). This
function has to wakeup the idle timer task to release the connection (and the idle
timer and the idle timer task itself). Then if in the meantime the connection
was flagged as having to process some retransmissions, some packet could lead
to sendto() errors again with a call to qc_kill_conn(), this time with a released
idle timer task.
This bug could be detected by libasan as follows:
.AddressSanitizer:DEADLYSIGNAL
=================================================================
==21018==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x 560b5d898717 bp 0x7f9aaac30000 sp 0x7f9aaac2ff80 T3)
==21018==The signal is caused by a READ memory access.
==21018==Hint: address points to the zero page.
. #0 0x560b5d898717 in _task_wakeup include/haproxy/task.h:209
#1 0x560b5d8a563c in qc_kill_conn src/quic_conn.c:171
#2 0x560b5d97f832 in qc_send_ppkts src/quic_tx.c:636
#3 0x560b5d981b53 in qc_send_app_pkts src/quic_tx.c:876
#4 0x560b5d987122 in qc_send_app_probing src/quic_tx.c:910
#5 0x560b5d987122 in qc_dgrams_retransmit src/quic_tx.c:1397
#6 0x560b5d8ab250 in quic_conn_app_io_cb src/quic_conn.c:712
#7 0x560b5de41593 in run_tasks_from_lists src/task.c:596
#8 0x560b5de4333c in process_runnable_tasks src/task.c:876
#9 0x560b5dd6f2b0 in run_poll_loop src/haproxy.c:2966
#10 0x560b5dd700fb in run_thread_poll_loop src/haproxy.c:3165
#11 0x7f9ab9188ea6 in start_thread nptl/pthread_create.c:477
#12 0x7f9ab90a8a2e in __clone (/lib/x86_64-linux-gnu/libc.so.6+0xfba2e)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV include/haproxy/task.h:209 in _task_wakeup
Thread T3 created by T0 here:
#0 0x7f9ab97ac2a2 in __interceptor_pthread_create ../../../../src/libsaniti zer/asan/asan_interceptors.cpp:214
#1 0x560b5df4f3ef in setup_extra_threads src/thread.c:252 o
#2 0x560b5dd730c7 in main src/haproxy.c:3856
#3 0x7f9ab8fd0d09 in __libc_start_main ../csu/libc-start.c:308 i
To fix, simply reset the connection flag QUIC_FL_CONN_RETRANS_NEEDED to cancel
the retransmission when qc_kill_conn is called.
Note that this new bug arrived with this fix which is correct and flagged as to be
backported as far as 2.6.
BUG/MINOR: quic: idle timer task requeued in the past
Amaury Denoyelle [Mon, 20 Nov 2023 13:56:49 +0000 (14:56 +0100)]
BUG/MAJOR: quic: complete thread migration before tcp-rules
A quic_conn is instantiated and tied on the first thread which has
received the first INITIAL packet. After handshake completion,
listener_accept() is called. For each quic_conn, a new thread is
selected among the least loaded ones Note that this occurs earlier if
handling 0-RTT data.
This thread connection migration is done in two steps :
* inside listener_accept(), on the origin thread, quic_conn
tasks/tasklet are killed. After this, no quic_conn related processing
will occur on this thread. The connection is flagged with
QUIC_FL_CONN_AFFINITY_CHANGED.
* as soon as the first quic_conn related processing occurs on the new
thread, the migration is finalized. This allows to allocate the new
tasks/tasklet directly on the destination thread.
This last step on the new thread must be done prior to other quic_conn
access. There is two events which may trigger it :
* a packet is received on the new thread. In this case,
qc_finalize_affinity_rebind() is called from quic_dgram_parse().
* the recently accepted connection is popped from accept_queue_ring via
accept_queue_process(). This will called session_accept_fd() as
listener.bind_conf.accept callback. This instantiates a new session
and start connection stack via conn_xprt_start(), which itself calls
qc_xprt_start() where qc_finalize_affinity_rebind() is used.
A condition was recently found which could cause a closing to be used
with qc_finalize_affinity_rebind() which is forbidden with a BUG_ON().
This lat step was not compatible with layer 4 rule such as "tcp-request
connection reject" which closes the connection early. In this case, most
of the body of session_accept_fd() is skipped, including
qc_xprt_start(), so thread migration is not finalized. At the end of the
function, conn_xprt_close() is then called which flags the connection as
CLOSING.
If a datagram is received for this connection before it is released,
this will call qc_finalize_affinity_rebind() which triggers its BUG_ON()
to prevent thread migration for CLOSING quic_conn.
FATAL: bug condition "qc->flags & ((1U << 29)|(1U << 30))" matched at src/quic_conn.c:2036
Thread 3 "haproxy" received signal SIGILL, Illegal instruction.
[Switching to Thread 0x7ffff794f700 (LWP 2973030)]
0x00005555556221f3 in qc_finalize_affinity_rebind (qc=0x7ffff002d060) at src/quic_conn.c:2036
2036 BUG_ON(qc->flags & (QUIC_FL_CONN_CLOSING|QUIC_FL_CONN_DRAINING));
(gdb) bt
#0 0x00005555556221f3 in qc_finalize_affinity_rebind (qc=0x7ffff002d060) at src/quic_conn.c:2036
#1 0x0000555555682463 in quic_dgram_parse (dgram=0x7fff5003ef10, from_qc=0x0, li=0x555555f38670) at src/quic_rx.c:2602
#2 0x0000555555651aae in quic_lstnr_dghdlr (t=0x555555fc4440, ctx=0x555555fc3f78, state=32832) at src/quic_sock.c:189
#3 0x00005555558c9393 in run_tasks_from_lists (budgets=0x7ffff7944c90) at src/task.c:596
#4 0x00005555558c9e8e in process_runnable_tasks () at src/task.c:876
#5 0x000055555586b7b2 in run_poll_loop () at src/haproxy.c:2966
#6 0x000055555586be87 in run_thread_poll_loop (data=0x555555d3d340 <ha_thread_info+64>) at src/haproxy.c:3165
#7 0x00007ffff7b59609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#8 0x00007ffff7a7e133 in clone () from /lib/x86_64-linux-gnu/libc.so.6
To fix this issue, ensure quic_conn migration is completed earlier
inside session_accept_fd(), before any tcp rules processing. This is
done by moving qc_finalize_affinity_rebind() invocation from
qc_xprt_start() to qc_conn_init().
Willy Tarreau [Mon, 20 Nov 2023 10:43:52 +0000 (11:43 +0100)]
BUILD: cache: fix build error on older compilers
pre-c99 compilers will fail to build the cache since commit 48f81ec09
("MAJOR: cache: Delay cache entry delete in reserve_hot function") due
to an int declaration in the for loop. No backport is needed.
Willy Tarreau [Mon, 20 Nov 2023 09:44:21 +0000 (10:44 +0100)]
BUG/MINOR: sock: mark abns sockets as non-suspendable and always unbind them
In 2.3, we started to get a cleaner socket unbinding mechanism with
commit f58b8db47 ("MEDIUM: receivers: add an rx_unbind() method in
the protocols"). This mechanism rightfully refrains from unbinding
when sockets are expected to be transferrable to another worker via
"expose-fd listeners", but this is not compatible with ABNS sockets,
which do not support reuseport, unbinding nor being renamed: in short
they will always prevent a new process from binding.
It turns out that this is not much visible because by pure accident,
GTUNE_SOCKET_TRANSFER is only set in the code dealing with master mode
and deamons, so it's never set in foreground mode nor in tests even if
present on the stats socket. However with master mode, it is now always
set even when not present on the stats socket, and will always conflict.
The only reasonable approach seems to consist in marking these abns
sockets as non-suspendable so that the generic sock_unbind() code can
decide to just unbind them regardless of GTUNE_SOCKET_TRANSFER.
This should carefully be backported as far as 2.4.
BUG/MINOR: startup: set GTUNE_SOCKET_TRANSFER correctly
This bug was forbidding the GTUNE_SOCKET_TRANSFER option to be set
when haproxy is neither in daemon mode nor in mworker mode. So it
basically only impacts the foreground mode.
The fix moves the code outside the 'if (global.mode & (MODE_DAEMON |
MODE_MWORKER | MODE_MWORKER_WAIT))' condition.
Bug was introduced with 7f80eb23 ("MEDIUM: proxy: zombify proxies only
when the expose-fd socket is bound").
Willy Tarreau [Sat, 18 Nov 2023 11:00:37 +0000 (12:00 +0100)]
[RELEASE] Released version 2.9-dev10
Released version 2.9-dev10 with the following main changes :
- CLEANUP: Re-apply xalloc_size.cocci (3)
- BUG/MEDIUM: stconn: Report send activity during mux-to-mux fast-forward
- BUG/MEDIUM: stconn: Don't report rcv/snd expiration date if SC cannot epxire
- MINOR: stconn: Don't queue stream task in past in sc_notify()
- BUG/MEDIUM: Don't apply a max value on room_needed in sc_need_room()
- BUG/MINOR: stconn: Sanitize report for read activity
- CLEANUP: htx: Properly indent htx_reserve_max_data() function
- DOC: stconn: Improve comments about lra and fsb usage
- BUG/MEDIUM: quic: fix actconn on quic_conn alloc failure
- BUG/MEDIUM: quic: fix sslconns on quic_conn alloc failure
- BUG/MEDIUM: mux-h1: Be sure xprt support splicing to use it during fast-forward
- MINOR: proto_reverse_connect: use connect timeout
- BUG/MINOR: mux-h1: Release empty ibuf during data fast-forwarding
- BUG/MINOR: stick-table/cli: Check for invalid ipv4 key
- MEDIUM: stktable/cli: simplify entry key handling
- MINOR: stktable/cli: support v6tov4 and v4tov6 conversions
- BUG/MINOR: mux-h1: Properly handle http-request and http-keep-alive timeouts
- BUG/MEDIUM: freq-ctr: Don't report overshoot for long inactivity period
- BUG/MEDIUM: pool: fix releasable pool calculation when overloaded
- BUG/MINOR: pool: check one other random bucket on alloc conflict
- BUG/MEDIUM: pool: try once to allocate from another bucket if empty
- MEDIUM: stconn/muxes: Loop on data fast-forwarding to forward at least a buffer
- MINOR: stconn/mux-h2: Use a iobuf flag to report EOI to consumer side during FF
- MEDIUM: quic: Heavy task mode during handshake
- MEDIUM: quic: Heavy task mode with non contiguously bufferized CRYPTO data
- MINOR: quic: release the TLS context asap from quic_conn_release()
- MINOR: quic: Add idle timer task pointer to traces
- BUG/MINOR: quic: idle timer task requeued in the past
- CLEANUP: quic: Indentation fix in qc_do_build_pkt()
- MINOR: quic: Avoid zeroing frame structures
- BUG/MEDIUM: quic: Too short Initial packet sent (enc. level allocation failed)
- BUG/MEDIUM: quic: Avoid trying to send ACK frames from an empty ack ranges tree
- BUG/MEDIUM: quic: Possible crashes when sending too short Initial packets
- BUG/MEDIUM: quic: Avoid some crashes upon TX packet allocation failures
- BUG/MEDIUM: quic: Possible crashes during secrets allocations (heavy load)
- BUG/MEDIUM: stconn: Don't update stream expiration date if already expired
- MINOR: errors: ha_alert() and ha_warning() uses warn_exec_path()
- MINOR: errors: does not check MODE_STARTING for log emission
- MEDIUM: errors: move the MODE_QUIET test in print_message()
- DOC: management: -q is quiet all the time
- MEDIUM: mworker: -W is mandatory when using -S
- BUG/MEDIUM: mux-h1: Exit early if fast-forward is not supported by opposite SC
- MEDIUM: quic: adjust address validation
- MINOR: quic: reduce half open counters scope
- MEDIUM: quic: limit handshake per listener
- MEDIUM: quic: define an accept queue limit
- BUG/MINOR: quic: fix retry token check inconsistency
- MINOR: task/debug: explicitly support passing a null caller to wakeup functions
- MINOR: task/debug: make task_queue() and task_schedule() possible callers
- OPTIM: mux-h2: don't allocate more buffers per connections than streams
- BUG/MINOR: quic: remove dead code in error path
- MEDIUM: quic: respect closing state even on soft-stop
- MEDIUM: quic: release conn socket before using quic_cc_conn
- DOC: config: use the word 'backend' instead of 'proxy' in 'track' description
- BUG/MEDIUM: applet: Remove appctx from buffer wait list on release
- MINOR: tools: make str2sa_range() directly return type hints
- BUG/MEDIUM: server: invalid address (post)parsing checks
- BUG/MINOR: sink: don't learn srv port from srv addr
- CLEANUP: sink: bad indent in sink_new_from_logger()
- CLEANUP: sink: useless leftover in sink_add_srv()
- BUG/MINOR: quic: Useless use of non-contiguous buffer for in order CRYPTO data
- MINOR: server: always initialize pp_tlvs for default servers
- BUG/MEDIUM: proxy: always initialize the default settings after init
- MEDIUM: startup: 'haproxy -c' is quiet when valid
- BUG/MINOR: sample: Fix bytes converter if offset is bigger than sample length
- BUG/MINOR: log: keep the ref in dup_logger()
- BUG/MINOR: quic: fix crash on qc_new_conn alloc failure
- BUG/MINOR: quic: fix decrement of half_open counter on qc alloc failure
- BUG/MEDIUM: quic: fix FD for quic_cc_conn
- DOC: config: Fix name for tune.disable-zero-copy-forwarding global param
- REGTESTS: startup: -conf-OK requires -V with current VTest
- BUG/MEDIUM: quic: Non initialized CRYPTO data stream deferencing
- MINOR: quic: Add a max window parameter to congestion control algorithms
- MINOR: quic: Maximum congestion control window for each algo
- DOC: quic: Wrong syntax for "quic-cc-algo" keyword.
- DOC: quic: Maximum congestion control window configuration
- BUG/MINOR: quic: maximum window limits do not match the doc
- BUG/MEDIUM: connection: report connection errors even when no mux is installed
- BUG/MINOR: stconn: Handle abortonclose if backend connection was already set up
- MINOR: connection: Add a CTL flag to notify mux it should wait for reads again
- MEDIUM: mux-h1: Handle MUX_SUBS_RECV flag in h1_ctl() and susbscribe for reads
- BUG/MEDIUM: stream: Properly handle abortonclose when set on backend only
- MINOR: stconn: Use SC to detect frontend connections in sc_conn_recv()
- REGTESTS: http: Improve script testing abortonclose option
- MINOR: activity: report profiling duration and age in "show profiling"
- BUG/MEDIUM: mworker: set the master variable earlier
- BUG/MEDIUM: stream: Don't call mux .ctl() callback if not implemented
- MINOR: connection: update rhttp flags usage
- BUG/MINOR: mux_h2: reject passive reverse conn if error on add to idle
- MINOR: server: force add to idle on reverse
- MINOR: shctx: Set last_append to NULL when reserving block in hot list
- MEDIUM: shctx: Move list between hot and avail list in O(1)
- MEDIUM: shctx: Simplify shctx_row_reserve_hot loop
- MINOR: shctx: Remove explicit 'from' param from shctx_row_data_append
- MEDIUM: cache: Use dedicated cache tree lock alongside shctx lock
- MINOR: cache: Remove expired entry delete in "show cache" command
- MINOR: cache: Add option to avoid removing expired entries in lookup function
- MEDIUM: cache: Use rdlock on cache in cache_use
- MEDIUM: shctx: Remove 'hot' list from shared_context
- MINOR: cache: Use dedicated trash for "show cache" cli command
- MEDIUM: cache: Switch shctx spinlock to rwlock and restrict its scope
- MEDIUM: cache: Add refcount on cache_entry
- MEDIUM: shctx: Descend shctx_lock calls into the shctx_row_reserve_hot
- MINOR: shctx: Add new reserve_finish callback call to shctx_row_reserve_hot
- MAJOR: cache: Delay cache entry delete in reserve_hot function
- MINOR: shctx: Remove redundant arg from free_block callback
- MINOR: shctx: Remove 'use_shared_mem' variable
- DOC: cache: Specify when function expects a cache lock
- BUG/MEDIUM: stconn: Update fsb date on partial sends
- MINOR: htx: Use a macro for overhead induced by HTX
- MINOR: channel: Add functions to get info on buffers and deal with HTX streams
- BUG/MINOR: stconn: Fix streamer detection for HTX streams
- BUG/MINOR: stconn: Use HTX-aware channel's functions to get info on buffer
- BUG/MINOR: stconn/applet: Report send activity only if there was output data
- BUG/MINOR: stconn: Report read activity on non-indep streams for partial sends
- BUG/MINOR: shctx: Remove old HA_SPIN_INIT
- REGTESTS: try to activate again the seamless reload test with the master CLI
- MINOR: proxy: Add "handshake" new timeout (frontend side)
- MEDIUM: quic: Add support for "handshake" timeout setting.
- MINOR: quic: Dump the expiration date of the idle timer task
- BUG/MINOR: quic: Malformed CONNECTION_CLOSE frame
- MEDIUM: session: handshake timeout (TCP)
- DOC: proxy: Add "handshake" timeout documentation.
- MINOR: quic: Rename "handshake" timeout to "client-hs"
- CLEANUP: haproxy: remove old comment from 1.1 from the file header
- BUG/MEDIUM: mux-h2: fail earlier on malloc in takeover()
- BUG/MEDIUM: mux-h1: fail earlier on malloc in takeover()
- BUG/MEDIUM: mux-fcgi: fail earlier on malloc in takeover()
- MINOR: rhttp: remove the unused outgoing connect() function
- MINOR: backend: without ->connect(), allow to pick another thread's connection
- BUG/MINOR: stream/cli: report correct stream age in "show sess"
- MINOR: stream/cli: add an optional "older" filter for "show sess"
- MINOR: stream/cli: add another filter "susp" to "show sess"
- MINOR: stktable: add stktable_deinit function
- BUG/MINOR: proxy/stktable: missing frees on proxy cleanup
- CLEANUP: backend: removing unused LB param
- MEDIUM: lbprm: store algo params on 32bits
- MEDIUM: log/balance: merge tcp/http algo with log ones
- Revert "MINOR: proxy: report a warning for max_ka_queue in proxy_cfg_ensure_no_http()"
- Revert "MINOR: tcp_rules: tcp-{request,response} requires TCP or HTTP mode"
- Revert "MINOR: stktable: "stick" requires TCP or HTTP mode"
- Revert "MINOR: cfgparse-listen: "http-send-name-header" requires TCP or HTTP mode"
- Revert "MINOR: cfgparse-listen: "dynamic-cookie-key" requires TCP or HTTP mode"
- Revert "MINOR: cfgparse-listen: "http-reuse" requires TCP or HTTP mode"
- Revert "MINOR: fcgi-app: "use-fcgi-app" requires TCP or HTTP mode"
- Revert "MINOR: http_htx/errors: prevent the use of some keywords when not in tcp/http mode"
- Revert "MINOR: flt_http_comp: "compression" requires TCP or HTTP mode"
- Revert "MINOR: filter: "filter" requires TCP or HTTP mode"
- MINOR: log/backend: ensure log exclusive params are not used in other modes
- MINOR: log/backend: prevent tcp-{request,response} use with LOG mode
- MINOR: log/backend: prevent stick table and stick rules with LOG mode
- MINOR: log/backend: prevent "http-send-name-header" use with LOG mode
- MINOR: log/backend: prevent "dynamic-cookie-key" use with LOG mode
- REGTESTS: http: add a test to validate chunked responses delivery
Willy Tarreau [Sat, 18 Nov 2023 10:01:28 +0000 (11:01 +0100)]
REGTESTS: http: add a test to validate chunked responses delivery
I've had this test here never committed over the last 2.5 years, that
works fine and I didn't notice it was not part of the tree. It makes a
server return odd-sized chunked responses with short pauses between half
of thems and verifies they're not truncated on the client. It may detect
eventually state machine breakages, so better commit it.
MINOR: log/backend: prevent tcp-{request,response} use with LOG mode
We start implementing some postparsing compatibility checks for log
backends.
Here we report a warning if user tries to use tcp-{request,response} rules
with log backend, and we properly ignore such rules when inherited from
defaults section.
MINOR: log/backend: ensure log exclusive params are not used in other modes
add proxy_cfg_ensure_no_log() function (similar to
proxy_cfg_ensure_no_http()) to ensure at the end of proxy parsing that
no log exclusive options are found if the proxy is not in log mode.
MEDIUM: log/balance: merge tcp/http algo with log ones
"log-balance" directive was recently introduced to configure the
balancing algorithm to use when in a log backend. However, it is
confusing and it causes issues when used in default section.
In this patch, we take another approach: first we remove the
"log-balance" directive, and instead we rely on existing "balance"
directive to configure log load balancing in log backend.
Some algorithms such as roundrobin can be used as-is in a log backend,
and for log-only algorithms, they are implemented as "log-$name" inside
the "backend" directive.
Make sure lbprm.algo can store 32bits by declaring it as uint32_t
Then, use all 32 available bits to offer 4 extra bits for the BE_LB_NEED
inputs. This will allow new required inputs to be easily added (up to 4
new ones, plus one that wasn't used yet if we keep them exclusive)
This required some cleanup: all ALGO bitfields were rewritten in the
32bits format and the high ones were shifted to make room for the
new BE_LB_NEED bits.
BE_LB_HASH_RND was introduced with 760e81d35 ("MINOR: backend: implement
random-based load balancing") but was never used since. Removing it
to regain an extra slot for future types.
Willy Tarreau [Fri, 17 Nov 2023 18:29:02 +0000 (19:29 +0100)]
MINOR: stream/cli: add another filter "susp" to "show sess"
This one reports streams considered as "suspicious", i.e. those with
no expiration dates or dates in the past, or those without a front
endpoint. More criteria could be added in the future.
Willy Tarreau [Fri, 17 Nov 2023 17:32:59 +0000 (18:32 +0100)]
MINOR: stream/cli: add an optional "older" filter for "show sess"
It's often needed to be able to refine "show sess" when debugging, and
very often a first glance at old streams is performed, but that's a
difficult task in large dumps, and it takes lots of resources to dump
everything.
This commit adds "older <age>" to "show sess" in order to specify the
minimum age of streams that will be dumped. This should simplify the
identification of blocked ones.
Willy Tarreau [Fri, 17 Nov 2023 17:51:26 +0000 (18:51 +0100)]
BUG/MINOR: stream/cli: report correct stream age in "show sess"
Since 2.4-dev2 with commit 15e525f49 ("MINOR: stream: Don't retrieve
anymore timing info from the mux csinfo"), we don't replace the
tv_accept (now accept_ts) anymore with the current request's, so that
it properly reflects the session's accept date and not the request's
date. However, since then we failed to update "show sess" to make use
of the request's timestamp instead of the session's timestamp, resulting
in fantasist values in the "age" field of "show sess" for the task.
Indeed, the session's age is displayed instead of the stream's, which
leads to great confusion when debugging, particularly when it comes to
multiplexed inter-proxy connections which are kept up forever.
Let's fix this now. This must be backported as far as 2.4. However,
for 2.7 and older, the field was named tv_request and was a timeval.
Willy Tarreau [Fri, 17 Nov 2023 09:53:36 +0000 (10:53 +0100)]
MINOR: backend: without ->connect(), allow to pick another thread's connection
If less connections than threads are established on a reverse-http gateway
and these servers have a non-nul pool-min-conn, then conn_backend_get()
will refrain from picking available connections from other threads. But
this makes no sense for protocols for which there is no ->connect(),
since there's no way the current thread will manage to establish its own
connection. For such situations we should always accept to use another
thread's connection. That's precisely what this patch does.
Willy Tarreau [Fri, 17 Nov 2023 09:52:13 +0000 (10:52 +0100)]
MINOR: rhttp: remove the unused outgoing connect() function
A dummy connect() function previously had to be installed for the log
server so that a reverse-http address could be referenced on a "server"
line, but after the recent rework of the server line parsing, this is
no longer needed, and this is actually annoying as it makes one believe
there is a way to connect outside, which is not true. Let's now get rid
of this function.
Willy Tarreau [Fri, 17 Nov 2023 09:56:33 +0000 (10:56 +0100)]
BUG/MEDIUM: mux-fcgi: fail earlier on malloc in takeover()
This is the equivalent of the previous "BUG/MEDIUM: mux-h1: fail earlier
on malloc in takeover()".
Connection takeover was implemented for fcgi in 2.2 by commit a41bb0b6c
("MEDIUM: mux_fcgi: Implement the takeover() method."). It does have one
corner case related to memory allocation failure: in case the task or
tasklet allocation fails, the connection gets released synchronously.
Unfortunately the situation is bad there, because the lower layers are
already switched to the new thread while the tasklet is either NULL or
still the old one, and calling fcgi_release() will also result in
touching the thread-local list of buffer waiters, calling unsubscribe(),
There are even code paths where the thread will try to grab the lock of
its own idle conns list, believing the connection is there while it has
no useful effect. However, if the owner thread was doing the same at the
same moment, and ended up trying to pick from the current thread (which
could happen if picking a connection for a different name), the two
could even deadlock.
No tests were made to try to reproduce the problem, but the description
above is sufficient to see that nothing can guarantee against it.
This patch takes a simple but radically different approach. Instead of
starting to migrate the connection before risking to face allocation
failures, it first pre-allocates a new task and tasklet, then assigns
them to the connection if the migration succeeds, otherwise it just
frees them. This way it's no longer needed to manipulate the connection
until it's fully migrated, and as a bonus this means the connection will
continue to exist and the use-after-free condition is solved at the same
time.
This should be backported to 2.2. Thanks to Fred for the initial analysis
of the problem!
Willy Tarreau [Fri, 17 Nov 2023 09:56:33 +0000 (10:56 +0100)]
BUG/MEDIUM: mux-h1: fail earlier on malloc in takeover()
This is the h1 equivalent of previous "BUG/MEDIUM: mux-h2: fail earlier
on malloc in takeover()".
Connection takeover was implemented for H1 in 2.2 by commit f12ca9f8f1
("MEDIUM: mux_h1: Implement the takeover() method."). It does have one
corner case related to memory allocation failure: in case the task or
tasklet allocation fails, the connection gets released synchronously.
Unfortunately the situation is bad there, because the lower layers are
already switched to the new thread while the tasklet is either NULL or
still the old one, and calling h1_release() will call some unsubscribe
and and possibly other things whose safety is not guaranteed (and the
ambiguity here alone is sufficient to be careful). There are even code
paths where the thread will try to grab the lock of its own idle conns
list, believing the connection is there while it has no useful effect.
However, if the owner thread was doing the same at the same moment, and
ended up trying to pick from the current thread (which could happen if
picking a connection for a different name), the two could even deadlock.
Contrary to mux-h2, a few tests were not sufficient to try to crash the
process, but there's nothing that indicates it couldn't happen based on
the description above.
This patch takes a simple but radically different approach. Instead of
starting to migrate the connection before risking to face allocation
failures, it first pre-allocates a new task and tasklet, then assigns
them to the connection if the migration succeeds, otherwise it just
frees them. This way it's no longer needed to manipulate the connection
until it's fully migrated, and as a bonus this means the connection will
continue to exist and the use-after-free condition is solved at the same
time.
This should be backported to 2.2. Thanks to Fred for the initial analysis
of the problem!
Willy Tarreau [Fri, 17 Nov 2023 09:56:33 +0000 (10:56 +0100)]
BUG/MEDIUM: mux-h2: fail earlier on malloc in takeover()
Connection takeover was implemented for H2 in 2.2 by commit cd4159f03
("MEDIUM: mux_h2: Implement the takeover() method."). It does have one
corner case related to memory allocation failure: in case the task or
tasklet allocation fails, the connection gets released synchronously.
Unfortunately the situation is bad there, because the lower layers are
already switched to the new thread while the tasklet is either NULL or
still the old one, and calling h2_release() will also result in
h2_process() and h2_process_demux() that may process any possibly
pending frames. Even the session remains the old one on the old thread,
so that some sess_log() that are called when facing certain demux errors
will be associated with the previous thread, possibly accessing a number
of elements belonging to another thread. There are even code paths where
the thread will try to grab the lock of its own idle conns list, believing
the connection is there while it has no useful effect. However, if the
owner thread was doing the same at the same moment, and ended up trying
to pick from the current thread (which could happen if picking a connection
for a different name), the two could even deadlock.
The risk is extremely low, but Fred managed to reproduce use-after-free
errors in conn_backend_get() after a takeover() failed by playing with
-dMfail, indicating that h2_release() had been successfully called. In
practise it's sufficient to have h2 on the server side with reuse-always
and to inject lots of request on it with -dMfail.
This patch takes a simple but radically different approach. Instead of
starting to migrate the connection before risking to face allocation
failures, it first pre-allocates a new task and tasklet, then assigns
them to the connection if the migration succeeds, otherwise it just
frees them. This way it's no longer needed to manipulate the connection
until it's fully migrated, and as a bonus this means the connection will
continue to exist and the use-after-free condition is solved at the same
time.
This should be backported to 2.2. Thanks to Fred for the initial analysis
of the problem!
Willy Tarreau [Fri, 17 Nov 2023 16:59:41 +0000 (17:59 +0100)]
CLEANUP: haproxy: remove old comment from 1.1 from the file header
There was still a totally outdated comment speaking about issues
affecting solaris on 1.1.8pre4 (April 2002, 21 year-old)! This
proves that comments in headers are never read, so let's take this
opportunity for also removing the outdated one recommending to read
the "updated" RFC7230.
Document the "handshake" timeout new setting available one frontend side.
This should at least be helpful for QUIC client connections to prevent
an attacker from refreshing plenty of connections without completing
the handshake step, leading haproxy to consume memory for nothing.
Adapt session_accept_fd() called on accept() to set the handshake timeout from
"hanshake-timeout" setting if set by configuration. If not set, continue to use
the "client" timeout setting.
This bug arrived with this commit:
MINOR: quic: Avoid zeroing frame structures
Before this latter, the CONNECTION_CLOSE was zeroed, especially the "reason phrase
length".
MINOR: quic: Dump the expiration date of the idle timer task
This date is shared between the idle timer and hanshake timeout. So, it should be
useful to dump the expiration date of the idle timer task itself, in place of the
idle timer expiration date. This way, the handshake timeout value will be visible
during the handshake from CLI "show quic full" command.
MEDIUM: quic: Add support for "handshake" timeout setting.
The idle timer task may be used to trigger the client handshake timeout.
The hanshake timeout expiration date (qc->hs_expire) is initialized when the
connection is allocated. Obviously, this timeout is taken into an account only
during the handshake by qc_idle_timer_do_rearm() whose job is to rearm the idle timer.
The idle timer expiration date could be initialized only one time, then
never updated until the hanshake completes. But this only works if the
handshake timeout is smaller than the idle timer task timeout. If the handshake
timeout is set greater than the idle timeout, this latter may expire before the
handshake timeout.
This patch may have an impact on the L1/C1 interop tests (with heavy packet loss
or corruption). This is why I guess some implementations with a hanshake timeout
support set a big timeout during this test. This is at least the case for ngtcp2
which sets a 180s hanshake timeout! haproxy will certainly have to proceed the
same way if it wants to have a chance to pass this test as before this handshake
timeout.
MINOR: proxy: Add "handshake" new timeout (frontend side)
Add a new timeout for the handshake, on the frontend side only. Such a hanshake
will be typically used for TLS hanshakes during client connections to TLS/TCP or
QUIC frontends.
REGTESTS: try to activate again the seamless reload test with the master CLI
Since the reload is now synchronous over the master CLI, try to reload
with it. This was a problem before with the signals because it wasn't
possible to wait for the end of the reload before sending the requests.
This activate again this test, we will see if it's more stable or we
will deactivate it again..
The shctx lock was changed from a SPINLOCK to a RWLOCK in commit ed35b94
"MEDIUM: cache: Switch shctx spinlock to rwlock and restrict its scope"
but a SPIN_INIT was left behind.
BUG/MINOR: stconn: Report read activity on non-indep streams for partial sends
Partial sends is an activity, not a full blocking. Thus a read activity must
be reported for non-independent stream. It is especially important for very
congested stream where full sends are uncommon.
BUG/MINOR: stconn/applet: Report send activity only if there was output data
For applets and connection, when a send attempt is performed, we must be
sure to not report a send activity if there was no output data at all before
the attempt.
It is not important for the <fsb> date itself but for the <lra> date for
non-independent stream.
BUG/MINOR: stconn: Use HTX-aware channel's functions to get info on buffer
Some channel function are used to check if the channel's buffer is full, not
empty or if there are input data. However, functions used are not
HTX-aware. So it is not accurate and may prevent some actions to be
performed (However, not sure there are really issues). Because HTX-aware
versions now exist, use them instead.
This patch may be backported as far as 2.2. It relies on
* "MINOR: channel: Add functions to get info on buffers and deal with HTX streams"
* "MINOR: htx: Use a macro for overhead induced by HTX"
BUG/MINOR: stconn: Fix streamer detection for HTX streams
Since the HTX was introduced, the streamer detection is broken for HTX
streams because the HTX overhead was not counted in the test to set
CF_STREAMER and CF_STREAMER_FAST flags.
The consequence was that the consumer side was no longer able to send more
than tune.ssl.maxrecord at a time in SSL.
To fix the issue, we now count the HTX overhead of HTX streams to be able to
set CF_STREAMER/CF_STREAMER_FAST flags on a channel.
This patch relies on folloing commits:
* "MINOR: channel: Add functions to get info on buffers and deal with HTX streams"
* "MINOR: htx: Use a macro for overhead induced by HTX"
MINOR: channel: Add functions to get info on buffers and deal with HTX streams
This patch adds HXT-aware versions of the functions c_data(), ci_data() and
c_empty(). channel_data() function returns the amount of data in the
channel, channel_input_data() returns the amount of input data and
channel_empty() returns true if the channel's buffer is empty. These
functions handles HTX buffers.
In addition, channel_data_limit() function, still HTX-aware, can be used to
get the maximum absolute amount of data that can be copied in a buffer,
independently on data already present in the buffer.
MINOR: htx: Use a macro for overhead induced by HTX
The overhead induced by the HTX format was set to the HTX structure itself
and two HTX blocks. It was set this way to optimize zero-copy during
transfers. This value may (and will) be used at different places. Thus we
now use a macro, called HTX_BUF_OVERHEAD.
BUG/MEDIUM: stconn: Update fsb date on partial sends
The first-send-blocked date was originally designed to save the date of the
first send of a series where some data remain blocked. It was relaxed
recently (3083fd90e "BUG/MEDIUM: stconn: Report a send activity everytime
data were sent") to save the date of the first full blocked send. However,
it is not accurrate.
When all data are sent, the fsb value must be reset to TICK_ETERNITY. When
nothing is sent and if it is not already set, it must be set. But when data
are partially sent, the value must be updated and not reset. Otherwise the
write timeout may be ignored because fsb date is never set.
So, changes brought by the patch above are reverted and
sc_ep_report_blocked_send() was changed to know if some data were sent or
not. This way we are able to update fsb value.
l
This patch must be backported to 2.8.
DOC: cache: Specify when function expects a cache lock
Some functions are built on the fact that the cache lock must be already
taken by the caller. This patch adds this information in the functions'
descriptions.
This global variable was used to avoid using locks on shared_contexts in
the unlikely case of nbthread==1. Since the locks do not do anything
when USE_THREAD is not defined, it will be more beneficial to simply
remove this variable and the systematic test on its value in the shared
context locking functions.
MAJOR: cache: Delay cache entry delete in reserve_hot function
A reference counter on the cache_entry was added in a previous commit.
Its value is atomically increased and decreased via the retain_entry and
release_entry functions.
This is needed because of the latest cache and shared_context
modifications that introduced two separate locks instead of the
preexisting single shctx_lock one.
With the new logic, we have two main blocks competing for the two locks:
- the one in the http_action_req_cache_use that performs a lookup in the
cache tree (locked by the cache lock) and then tries to remove the
corresponding blocks from the shared_context's 'avail' list until the
response is sent to the client by the cache applet,
- the shctx_row_reserve_hot that traverses the 'avail' list and gives
them back to the caller, while removing previous row heads from the
cache tree
Those two blocks require the two locks but one of them would take the
cache lock first, and the other one the shctx_lock first, which would
end in a deadlock without the current patch.
The way this conflict is resolved in this patch is by ensuring that at
least one of those uses works without taking the two locks at the same
time.
The solution found was to keep taking the two locks in the cache_use
case. We first lock the cache to lookup for an entry and we then take
the shctx lock as well to detach the corresponding blocks from the
'avail' list. The subtlety is that between the cache lookup and the
actual locking of the shctx, another thread might have called the
reserve_hot function in which we only take the shctx lock.
In this function we traverse the 'avail' list to remove blocks that are
then given to the caller. If one of those blocks corresponds to a
previous row head, we call the 'free_blocks' callback that used to
delete the cache entry from the tree.
We now avoid deleting directly the cache entries in reserve_hot and we
rather set the cache entries 'complete' param to 0 so that no other
thread tries to work with this entry. This way, when we release the
shctx lock in reserve_hot, the first thread that had performed the cache
lookup and had found an entry that we just gave to another thread will
see that the 'complete' field is 0 and it won't try to work with this
response.
The actual removal of entries from the cache tree will now be performed
in the new 'reserve_finish' callback called at the end of the
shctx_row_reserve_hot function. It will iterate on all the row head that
were inserted in a dedicated list in the 'free_block' callback and
perform the actual delete.
MINOR: shctx: Add new reserve_finish callback call to shctx_row_reserve_hot
This patch adds a reserve_finish callback that can be defined by the
subsystems that require a shared_context. It is called at the end of
shctx_row_reserve_hot after the shared_context lock is released.
MEDIUM: shctx: Descend shctx_lock calls into the shctx_row_reserve_hot
Descend the shctx_lock calls into the shctx_row_reserve_hot so that the
cases when we don't need to lock anything (enough space in the current
row or not enough space in the 'avail' list) do not take the lock at
all.
In sh_ssl_sess_new_cb the lock had to be descended into
sh_ssl_sess_store in order not to cover the shctx_row_reserve_hot call
anymore.
Add a reference counter on the cache_entry. Its value will be atomically
increased and decreased via the retain_entry and release_entry
functions.
The release_entry function has two distinct versions,
release_entry_locked and release_entry_unlocked that should be called
when the cache lock is already taken in write mode or not
(respectively). In the unlocked case the cache lock will only be taken
in write mode on the last reference of the entry (before calling
delete_entry). This allows to limit the amount of times when we need to
take the cache lock during a release operation.
MEDIUM: cache: Switch shctx spinlock to rwlock and restrict its scope
Since a lock on the cache tree was added in the latest cache changes, we
do not need to use the shared_context's lock to lock more than pure
shared_context related data anymore. This already existing lock will now
only cover the 'avail' list from the shared_context. It can then be
changed to a rwlock instead of a spinlock because we might want to only
run through the avail list sometimes.
Apart form changing the type of the shctx lock, the main modification
introduced by this patch is to limit the amount of code covered by the
shctx lock. This lock does not need to cover any code strictly related
to the cache tree anymore.
MINOR: cache: Use dedicated trash for "show cache" cli command
After the latest changes in the cache/shared_context mechanism, the
cache and shared_context logic were decorrelated and in some unlikely
cases we might end up using the "show cache" command while some regular
cache processing is occurring (a response being stored in the cache for
instance). In such a case, because we used the same 'trash' buffer in
those two contexts, we could end up with the contents of a response in
the ouput of the "show cache" command.
This patch fixes this problem by allocating a dedicated trash for the
CLI command.
MEDIUM: shctx: Remove 'hot' list from shared_context
The "hot" list stored in a shared_context was used to keep a reference
to shared blocks that were currently being used and were thus removed
from the available list (so that they don't get reused for another cache
response). This 'hot' list does not ever need to be shared across
threads since every one of them only works on their current row.
The main need behind this 'hot' list was to detach the corresponding
blocks from the 'avail' list and to have a known list root when calling
list_for_each_entry_from in shctx_row_data_append (for instance).
Since we actually never need to iterate over all members of the 'hot'
list, we can remove it and replace the inc_hot/dec_hot logic by a
detach/reattach one.
When looking for a valid entry in the cache tree in
http_action_req_cache_use, we do not need to delete an expired entry at
once because even if an expired entry exists, since the request will be
forwarded to the server, then the expired entry will be overwritten when
the updated response is seen. We can then use a simpler rdlock during
cache_use operation.
MINOR: cache: Add option to avoid removing expired entries in lookup function
Any lookup in the cache tree done through entry_exist or
secondary_entry_exist functions could end up deleting the corresponding
entry if it is expired which prevents from using a rdlock on code paths
that would just perform a lookup on the tree (in
http_action_req_cache_use for instance).
Adding a 'delete_expired' boolean as a parameter allows for "pure"
lookups and thus it will allow to perform operations on the tree that
simply require a rdlock instead of a "heavier" wrlock.
MINOR: cache: Remove expired entry delete in "show cache" command
The "show cache" CLI command iterates over all the entries of the cache
tree and it used this opportunity to remove expired entries from the
cache. This behavior was completely undocumented and does not seem that
necessary. By removing it we can take the cache lock in read mode only
which limits the impact on the other threads.
MEDIUM: cache: Use dedicated cache tree lock alongside shctx lock
Every use of the cache tree was covered by the shctx lock even when no
operations were performed on the shared_context lists (avail and hot).
This patch adds a dedicated RW lock for the cache so that blocks of code
that work on the cache tree only can use this lock instead of the
superseding shctx one. This is useful for operations during which the
concerned blocks are already in the hot list.
When the two locks need to be taken at the same time, in
http_action_req_cache_use and in shctx_row_reserve_hot, the shctx one
must be taken first.
A new parameter needed to be added to the shared_context's free_block
callback prototype so that cache_free_block can take the cache lock and
release it afterwards.
The shctx_row_reserve_hot relied on two loop levels in order to first
look for the first block of a preused row and then iterate on all the
blocks of this row to reserve them for the new row. This was not the
simplest nor the easiest to read way so this logic could be replaced by
a single iteration on the avail list members.
The two use cases of calling this function with or without a preexisting
"first" member were a bit cumbersome as well and were replaced by a more
straightforward approach.
MEDIUM: shctx: Move list between hot and avail list in O(1)
Instead of iterating over all the elements of a given row when moving it
between the hot and available lists, we can make use of the last_reserved
pointer that already points to the last block of the list to perform the
move in O(1).
MINOR: shctx: Set last_append to NULL when reserving block in hot list
Ensure that the last_append pointer is always set to NULL on first block
of rows reserved by the subsystems using the shctx (cache for instance).
This pointer will be used directly in shctx_row_data_append instead of
the 'from' param which will simplify its uses.
Amaury Denoyelle [Thu, 16 Nov 2023 16:16:00 +0000 (17:16 +0100)]
MINOR: server: force add to idle on reverse
A backend connection is inserted in server idle list via
srv_add_to_idle_list(). This function has several conditions which may
cause the connection to be rejected instead.
One of this condition is based on the current estimate count of needed
connections for the server. If the count of idle connections stored has
already reached this estimation, the new connection is rejected. This is
in opposition with the purpose of reverse HTTP. On active reverse,
haproxy can instantiate several connections to properly serve the future
traffic. However, the opposite passive haproxy will have only a low
estimate of needed connection and will reject most of them.
To fix this, simply check CO_FL_REVERSED connection flag on
srv_add_to_idle_list(). If set, the connection is inserted without
checking for estimate count. Note that all other conditions are not
impacted, so it's still possible to reject a connection, for example if
process FD limit is reached.
This commit relies on recent patch which change CO_FL_REVERSED flag for
connection after passive reverse.
Amaury Denoyelle [Thu, 16 Nov 2023 16:13:41 +0000 (17:13 +0100)]
BUG/MINOR: mux_h2: reject passive reverse conn if error on add to idle
On passive reverse, H2 mux is responsible to insert the connection in
the server idle list. This is done via srv_add_to_idle_list(). However,
this function may fail for various reason, such as FD usage limit
reached.
Handle properly this error case. H2 mux flags the connection on error
which will cause its release. Prior to this patch, the connection was
only released on server timeout.
This bug was found inspecting server curr_used_conns counter. Indeed, on
connection reverse, this counter is first incremented. It is decremented
just after on srv_add_to_idle_list() if insertion is validated. However,
if insertion is rejected, the connection was not released which cause
curr_used_conns to remains positive. This has the major downside to
break the reusing of idle connection on rhttp causing spurrious 503
errors.
Amaury Denoyelle [Thu, 16 Nov 2023 16:13:28 +0000 (17:13 +0100)]
MINOR: connection: update rhttp flags usage
Change the flags used for reversed connection :
* CO_FL_REVERSED is now put after reversal for passive connect. For
active connect, it is delayed when accept is completed after reversal.
* CO_FL_ACT_REVERSING replace the old CO_FL_REVERSED. It is put only for
active connect on reversal and removes once accept is done.
This allows to identify a connection as reversed during its whole
lifetime. This should be useful to extend reverse connect.
BUG/MEDIUM: stream: Don't call mux .ctl() callback if not implemented
The commit 5ff7d2276 ("BUG/MEDIUM: stream: Properly handle abortonclose when set
on backend only") introduced a regression. Not all multiplexer implement the
.ctl() callback function. Thus we must be sure this callback function is defined
first to call it.
This patch should fix a crash reported by Tristan in the issue #2095. It must be
backported as far as 2.2, with the commit above.
BUG/MEDIUM: mworker: set the master variable earlier
Since 2.7 and the mcli_reload_bind_conf (56f73b21a5c), upon a reload
failure because of a bind error, the mcli_reload_bind_conf go through a
sock_unbind((). This is not supposed to do anything when a listener is
RX_F_INHERITED in the master, but unfortunately this is done too early
and provokes an exit of the master.
We already suspected in the past that setting the 'master' variable this
late could have negative impact.
The fix sets the master variable earlier before the bind.
This must be backported at least to 2.7. This could be backported
earlier but better wait any feedbacks on the fix.
Willy Tarreau [Tue, 14 Nov 2023 10:44:48 +0000 (11:44 +0100)]
MINOR: activity: report profiling duration and age in "show profiling"
Seeing counters in "show profiling" is not always very helpful without
an indication of how long the analysis lasted nor if it's still active
or not. Let's add a pair of start/stop timers for tasks and memory so
that we can now indicate how long the measurements lasted and when they
ended (or 0 if still running).
Note that for tasks profiling set to "auto", the measurement is considered
enabled since it can automatically switch on and off on a per-thread
basis.
We now take care to properly handle the abortonclose close option if it is
set on the backend and be sure we ignore it when it is set on the frontend
(inherited from the defaults section).
BUG/MEDIUM: stream: Properly handle abortonclose when set on backend only
Since the 2.2 and the commit dedd30610 ("MEDIUM: h1: Don't wake the H1 tasklet
if we got the whole request."), we avoid to subscribe for reads if the H1
message is fully received. However, this broke the abortonclose option. To fix
the issue, a CO_RFL flag was added to instruct the mux it should still wait for
read events to properly handle read0. Only the H1 mux was concerned.
But since then, most of time, the option is only handled if it is set on the
frontend proxy because the request is fully received before selecting the
backend. If the backend is selected before the end of the request there is no
issue. But otherwise, because the backend is not known yet, we are unable to
properly handle the option and we miss to subscribe for reads.
Of course the option cannot be set on a frontend proxy. So concretly it means
the option is properly handled if it is enabled in the defaults section (if
common to frontend and backend) or a listen proxy, but it is ignored if it is
set on backend only.
Thanks to previous patches, we can now instruct the mux it should subscribe for
reads if not already done. We use this mechanism in process_stream() when the
connection is set up, ie when backend SC is set to SC_ST_REQ state.
This patch relies on following patches:
* MINOR: connection: Add a CTL flag to notify mux it should wait for reads again
* MEDIUM: mux-h1: Handle MUX_SUBS_RECV flag in h1_ctl() and susbscribe for reads
This patch should be the issue #2344. All the series must be backported as far
as 2.2.
MEDIUM: mux-h1: Handle MUX_SUBS_RECV flag in h1_ctl() and susbscribe for reads
The H1 mux now handle MUX_SUBS_RECV flag in h1_ctl(). If it is not already
subscribed for reads, it does so. This patch will be mandatory to properly
handle abortonclose option.
BUG/MINOR: stconn: Handle abortonclose if backend connection was already set up
abortonclose option is a backend option, it should not be handle on frontend
side. Of course a frontend can also be a backend but the option should not
be handled too early because it is not necessarily the selected backend
(think about a listen proxy routing requests to another backend).
It is especially an issue when the abortonclose option is enabled in the
defaults section and disabled by the selected backend. Because in this case,
the option may still be enabled while it should not.
Thus, now we wait the backend connection was set up to handle the option. To
do so, we check the backend SC state. The option is ignored if it is in
ST_CS_INI state. For all other states, it means the backend was already
selected.
Willy Tarreau [Tue, 14 Nov 2023 06:55:04 +0000 (07:55 +0100)]
BUG/MEDIUM: connection: report connection errors even when no mux is installed
An annoying issue was met when testing the reverse-http mechanism, by
which failed connection attempts would apparently not be attempted again
when there was no connect timeout. It turned out to be more generalized
than the rhttp system, and actually affects all outgoing connections
relying on NPN or ALPN to choose the mux, on which no mux is installed
and for which the subscriber (ssl_sock) must be notified instead.
The problem appeared during 2.2-dev1 development. First, commit 062df2c23 ("MEDIUM: backend: move the connection finalization step to
back_handle_st_con()") broke the error reporting by testing CO_FL_ERROR
only under CO_FL_CONNECTED. While it still worked OK for cases where a
mux was present, it did not for this specific situation because no
single error path would be considered when no mux was present. Changing
the CO_FL_CONNECTED test to also include CO_FL_ERROR did work, until
a few commits later with 477902bd2 ("MEDIUM: connections: Get ride of
the xprt_done callback.") which removed the xprt_done callback that was
used to indicate success or failure of the transport layer setup, since,
as the commit explains, we can report this via the mux. What this last
commit says is true, except when there is no mux.
For this, however, the sock_conn_iocb() function (formerly conn_fd_handler)
is called for such errors, evaluates a number of conditions, none of which
is matched in this error condition case, since sock_conn_check() instantly
reports an error causing a jump to the leave label. There, the mux is
notified if installed, and the function returns. In other error condition
cases, readiness and activity are checked for both sides, the tasklets
woken up and the corresponding subscriber flags removed. This means that
a sane (and safe) approach would consist in just notifying the subscriber
in case of error, if such a subscriber still exists: if still there, it
means the event hasn't been caught earlier, then it's the right moment
to report it. And since this is done after conn_notify_mux(), it still
leaves all control to the mux once it's installed.
This commit should be progressively backported as far as 2.2 since it's
where the problem was introduced. It's important to clearly check the
error path in each function to make sure the fix still does what it's
supposed to.
BUG/MINOR: quic: maximum window limits do not match the doc
This bug arrived with this commit:
MINOR: quic: Add a max window parameter to congestion control algorithms
The documentation was been modified with missing/wrong modifications in the code part.
The 'g' suffix must be accepted to parse value in gigabytes. And exctly 4g is
also accepted.
MINOR: quic: Maximum congestion control window for each algo
Make all the congestion support the maximum congestion control window
set by configuration. There is nothing special to explain. For each
each algo, each time the window is incremented it is also bounded.
MINOR: quic: Add a max window parameter to congestion control algorithms
Add a new ->max_cwnd member to bind_conf struct to store the maximum
congestion control window value for each QUIC binding.
Modify the "quic-cc-algo" keyword parsing to add an optional parameter
to its value: the maximum congestion window value between parentheses
as follows:
ex: quic-cc-algo cubic(10m)
This value must be bounded, greater than 10k and smaller than 1g.
BUG/MEDIUM: quic: Non initialized CRYPTO data stream deferencing
This bug arrived with this commit:
BUG/MINOR: quic: Useless use of non-contiguous buffer for in order CRYPTO data
Before this commit qc->cstream was tested before entering qc_treat_rx_crypto_frms().
This patch restablishes this behavior. Furthermore, it simplyfies
qc_ssl_provide_all_quic_data() which is a little bit ugly: the CRYPTO data
frame may be freed asap in the list_for_each_entry_safe() block after
having store its data pointer and length in local variables.
Also interrupt the CRYPTO data process as soon as qc_ssl_provide_quic_data()
or qc_treat_rx_crypto_frms() fail.
REGTESTS: startup: -conf-OK requires -V with current VTest
Current version of VTest tests the output of "haproxy -c" instead of the
return code. Since we don't output anymore when the configuration is
valid, this broke the test. (a06f621).
This fixes the issue by adding the -V when doing a -conf-OK. But this
must fixed in VTest.
DOC: config: Fix name for tune.disable-zero-copy-forwarding global param
"disable-" prefix was missing. This param was correctly named in the list of
supported keywords in the global section, but not in the keyword
description.
Amaury Denoyelle [Mon, 13 Nov 2023 10:30:36 +0000 (11:30 +0100)]
BUG/MEDIUM: quic: fix FD for quic_cc_conn
Since following commit, quic_conn closes its owned socket before
transition to quic_cc_conn for closing state. This allows to save FDs as
quic_cc_conn could use the listener socket for their I/O.
This patch is incomplete as it removes initialization of <fd> member for
quic_cc_conn. Thus, if sending is done on closing state, <fd> value is
undefined which in most cases will result in a crash. Fix this by simply
initializing <fd> member with qc_init_fd() in qc_new_cc_conn().
This bug should fix recent issue from #2095. Thanks to Tristan for its
reporting and then testing of this patch.
Amaury Denoyelle [Mon, 13 Nov 2023 09:55:30 +0000 (10:55 +0100)]
BUG/MINOR: quic: fix decrement of half_open counter on qc alloc failure
Half open counter is used to comptabilize QUIC connections waiting for
address validation. It was recently reworked to adjust its scope. With
each decrement operation, a BUG_ON() was added to ensure the counter
never wraps.
This BUG_ON() could be triggered if an allocation fails for one of
quic_conn members in qc_new_conn(). This is because half open counter is
incremented at the end of qc_new_conn(). However, in case of alloc
failure, quic_conn_release() is called immediately to ensure the counter
is decremented if a connection is freed before peer address has been
validated.
To fix this, increment half open counter early in qc_new_conn() prior to
every quic_conn members allocations.