Willy Tarreau [Thu, 21 Nov 2024 18:11:18 +0000 (19:11 +0100)]
BUG/MEDIUM: wdt: fix the stuck detection for warnings
If two slow tasks trigger one warning even a few seconds apart, the
watchdog code will mistakenly take this for a definite stuck task and
kill the process. The reason is that since commit 148eb5875f ("DEBUG:
wdt: better detect apparently locked up threads and warn about them")
the updated ctxsw count is not the correct one, instead of updating
the private counter it resets the public one, preventing it from making
progress and making the wdt believe that no progress was made. In
addition the initial value was read from [tid] instead of [thr].
Please note that another fix is needed in debug_handler() otherwise the
watchdog will fire early after the first warning or thread dump.
A simple test for this is to issue several of these commands back-to-back
on the CLI, which crashes an unfixed 3.1 very quickly:
This needs to be backported to 2.9 since the fix above was backported
there. The impact on 3.0 and 2.9 is almost inexistent since the watchdog
there doesn't apply the shorter warning delay, so the first call already
indicates that the thread is stuck.
Willy Tarreau [Thu, 21 Nov 2024 18:19:46 +0000 (19:19 +0100)]
BUG/MEDIUM: debug: don't set the STUCK flag from debug_handler()
Since 2.0 with commit e6a02fa65a ("MINOR: threads: add a "stuck" flag
to the thread_info struct"), the TH_FL_STUCK flag was set by the
debugger to flag that a thread was stuck and report it in the output.
However, two commits later (2bfefdbaef "MAJOR: watchdog: implement a
thread lockup detection mechanism"), this flag was used to detect that
a thread had already been reported as stuck. The problem is that it
seldom happens that a "show threads" command instantly crashes because
it calls debug_handler(), which sets the flag, and if the watchdog timer
was about to trigger before going back to the scheduler, the watchdog
believes that the thread has been stuck for a while and will kill the
process.
The issue was magnified in 3.1 with the lower-delay warning, because
it's possible for a thread to die on the next wakeup after the first
warning (which calls debug_handler() hence sets the STUCK flag).
One good approach would have been to use two distinct flags, one for
"stuck" as reported by the debug handler, and one for "stuck" as seen
by the watchdog. However, one could also argue that since the second
commit, given that the wdt monitors the threads, there's no point any
more for the debug handler to set the flag itself. Removing this code
means that two consecutive "show threads" will not report "stuck" until
the watchdog sets it, which aligns better with expectations.
This can be backported to all stable releases. This code has changed a
bit over time, the "if" block and the harmless variables just need to
be removed.
BUG/MINOR: config: allow to check HAPROXY_LOCALPEER in config
This fixes the commit d6ccd1738bae
("MINOR: startup: set HAPROXY_LOCALPEER only once"). HAPROXY_LOCALPEER could
be checked in the configuration to set some servers settings or listeners. So,
we need to set it just before we read the configuration at the second time.
Let's mark HAPROXY_LOCALPEER as "usable" in the configuration in the related
documentation chapter.
This should not be backported, as related to the latest master-worker
refactoring.
This commit prepares the parsing of localpeer keyword in MODE_DISCOVERY. We
need this, as HAPROXY_LOCALPEER environment variable could be checked in the
configuration in order to enable some backend or frontend settings.
So, let's at first add a dedicated parser for localpeer. At second, we no
longer need to check, if cfg_peers is valid pointer, as in MODE_DISCOVERY we
parse only the "global" section.
In addition, let's make the code of localpeer parser a little bit more
readable.
MINOR: capabilities: rename program_name argument to progname
This commit prepares the usage of the global progname variable.
prepare_caps_from_permitted_set() use progname value in warning messages. So,
let's rename program_name argument to progname.
BUG/MINOR: startup: fix UAF when set the default for log_tag
In the init_early() global.log_tag is initialized to the string from progname
pointer and global.log_tag.area points to this pointer.
If log-tag keyword is provided in the configuration, its parser at first frees
global.log_tag.area and then it does a new memory allocation to copy
there the argument of log-tag. So, progname no longer points to the valid
memory.
To fix this, let's always keep progname and global.log_tag.area at separate
memory areas. If log_tag will be redefined in the configuration, its parser will
free the memory allocated for the default value in chunk_destroy(). Memory
allocated for progname will be freed in deinit().
This should not be backported as related to the latest master-worker
refactoring.
MINOR: cfgparse-global: add more checks for "chroot" argument
If directory provided as a "chroot" keyword argument does not exist or
inaccessible, this is reported only at the latest initialization stage, when
haproxy tries to perform chroot. Sometimes it's not very convenient, as the
process is already bound to listen sockets.
This was done explicitly in order not to break the case, when haproxy is
launched with "-c" option in some specific environment, where it's not possible
to create or to modify chroot directory, provided in the configuration.
So, let's add more checks for "chroot" directory during the parsing
stage and let's show diagnostic warnings, if this directory has become
non-accesible or was deleted. Like this, users, who wants to catch errors
related to misconfigured chroot before starting the process, can launch haproxy
with -dW and -dD. zero-warning mode will stop the process with error, if any
warning was emitted during initialization stage.
BUG/MINOR: quic: Missing application limitations tracking for BBR
The ->app_limited member of the delivery rate struct (quic_cc_drs) aim is to
store the index of the last transmitted byte marked as application-limited
so that to track the application-limited phases. During these phases,
BBR must ignore delivery rate samples to properly estimate the delivery rate.
Without such a patch, the Startup phase could be exited very quickly with
a very low estimated bottleneck bandwidth. This had a very bad impact
on little objects with download times smaller than the expected Startup phase
duration. For such objects, with enough bandwith, BBR should stay in the Startup
state.
No need to be backported, as BBR is implemented in the current developement version.
Amaury Denoyelle [Thu, 21 Nov 2024 10:07:15 +0000 (11:07 +0100)]
MINOR: quic: support pacing for newreno and nocc
Extend extra pacing support for newreno and nocc congestion algorithms,
as with cubic.
For better extensibility of cc algo definition, define a new flags field
in quic_cc_algo structure. For now, the only value is
QUIC_CC_ALGO_FL_OPT_PACING which is set if pacing support can be
optionally activated. Both cubic, newreno and nocc now supports this.
This new flag is then reused by QUIC config parser. If set, extra
quic-cc-algo burst parameter is taken into account. If positive, this
will activate pacing support on top of the congestion algorithm. As with
cubic previously, pacing is only supported if running under experimental
mode.
Only BBR is not flagged with this new value as pacing is directly
builtin in the algorithm and cannot be turn off. Furthermore, BBR
calculates automatically its value for maximum burst. As such, any
quic-cc-algo burst argument used with BBR is still ignored with a
warning.
Amaury Denoyelle [Thu, 21 Nov 2024 10:15:47 +0000 (11:15 +0100)]
BUG/MINOR: cfgparse-quic: fix warning for cc-aglo with 0 burst
Optional burst argument for quic-cc-algo is used to toggle pacing
support on top of cubic. This is the case if it is positive.
The default value is 0, which do not activate pacing. However, in this
case, an incorrect warning is reported about the parameter being
ignored. Fix this by removing the warning in this case.
MINOR: quic: Useless rate sample member initialization
This poor/inefficient code has been revealed by coverity GH issue in #2788 where
some quic_cc_rs struct member initializations were mentionned as overwritten
(after initialization) before being used as follows:
CID 1565821: Code maintainability issues (UNUSED_VALUE)
/src/quic_cc_bbr.c: 1373 in bbr_handle_lost_packet()
1367 }
1368
1369 static void bbr_handle_lost_packet(struct bbr *bbr, struct quic_cc_path *p,
1370 struct quic_tx_packet *pkt,
1371 uint32_t lost)
1372 {
>>> CID 1565821: Code maintainability issues (UNUSED_VALUE)
>>> Assigning value "0UL" to "rs.tx_in_flight" here, but that stored value is overwritten before it can be used.
1373 struct quic_cc_rs rs = {0};
1374
1375 /* C.delivered = bbr->drs.delivered */
1376 bbr_note_loss(bbr, bbr->drs.delivered);
1377 if (!bbr->bw_probe_samples)
1378 return; /* not a packet sent while probing bandwidth */
Remove the {0} initializer for <rs> variable. This is safe because the members
initializations of <rs> local variable passed to functions from
bbr_handle_lost_packet() are done. Add a comment to mention this.
Amaury Denoyelle [Thu, 21 Nov 2024 09:41:42 +0000 (10:41 +0100)]
MINOR: cfgparse-quic: activate pacing only via burst argument
Recently, pacing support was added for cubic congestion algorithm. This
was activated by using the new token "cubic-pacing" on quic-cc-algo.
Furthermore, it was possible to define a burst size with a new
parameters after congestion token between parenthesis.
This configuration is not oblivious to users. In particular, it can
cause to easily forgot to tweak burst size, which can dramatically
impact performance.
Simplify this by removing the extra "-pacing" suffix. Now, pacing will
be activated solely based on the burst parameter. If 0, burst is
considered as infinite and no pacing will be used. Pacing will be
activating for any positive burst. This better reflects the link between
pacing and burst and its importance.
Note that for the moment, if burst is specified, it will be ignored with
a warning for algorithm outside of cubic.
This is not a breaking change as pacing support was implemented in the
current dev version.
Amaury Denoyelle [Thu, 21 Nov 2024 09:24:34 +0000 (10:24 +0100)]
BUG/MINOR: cfgparse-quic: fix bbr initialization
To support pacing with cubic, a recent change was introduced to render
quic_cc_algo on bind line dynamically allocated, instead of pointing to
a globally defined variable. This allows customization of the algorithm
callbacks per bind line.
This was not correctly used for BBR as it was set to point to the global
quic_cc_algo_bbr. This causes a segfault on haproxy process closing. Fix
this by properly initializing BBR as other algorithms.
This should fix coverity report from github issue #2786.
CLEANUP: cfgparse: Add direction in functions name that warn on misplaced rules
This only concerns functions emitting warnings about misplaced tcp-request
rules. The direction is now specified in the functions name. For instance
"warnif_misplaced_tcp_conn" is replaced by "warnif_misplaced_tcp_req_conn".
MINOR: config: Improve warnings on misplaced rules by adding an optional arg
In warnings about misplaced rules, only the first keyword is mentionned. It
works well for http-request or quic-initial rules for instance. But it is a
bit confusing for tcp-request rules, because the layer is missing (session
or content).
To make it a bit systematic (and genric), the second argument can now be
provided. It can be set to NULL if there is no layer or scope. But
otherwise, it may be specified and it will be reported in the warning.
So the following snippet:
tcp-request content reject if FALSE
tcp-request session reject if FALSE
tcp-request connection reject if FALSE
Will now emit the following warnings:
a 'tcp-request session' rule placed after a 'tcp-request content' rule will still be processed before.
a 'tcp-request connection' rule placed after a 'tcp-request session' rule will still be processed before.
Willy Tarreau [Wed, 20 Nov 2024 17:03:50 +0000 (18:03 +0100)]
BUILD: makefile: reorder object files by build time
mux_spop is quite long to build and was at the end. The rest did not
change much, but the build time is now dominated by hlua.o and mux_h2.o
and by a large margin. On the 80-core ARM mux_h2.o is present from
beginning to end and on the PC it's hlua.o, so both might have to be
split at some point to benefit from multi-core.
Nevertheless, the changes allowed to shrink about one second out of
the 18 it was taking on that machine.
Willy Tarreau [Wed, 20 Nov 2024 16:34:36 +0000 (17:34 +0100)]
BUILD: makefile: build flags.c before haproxy to speed up the build
The end of the build is often super slow. In practice it's flags.o that
now takes ages (3.4 seconds) and blocks everything on a single core at
the end. Let's declare it before the haproxy target so that it starts
earlier. On a quad-2.2 GHz CPU, the build time goes down from 44 to 42s
and the end feels less painful.
DOC: management: Clearly state "show errors" only reports malformed H1 messages
For now, only the H1 multiplexer is able to capture malformed messages. So
it is better to update the management guide accordingly to avoid any
confusion.
DOC: config: Improve documentation of tune.http.maxhdr directive
The description was inproved to clrealy mentionned it is applied on received
requests and responses. In addition, a comment was added about HTTP/2 and
HTTP/3 limitation when messages are encoded to be sent.
BUG/MEDIUM: h3: Increase max number of headers when sending headers
In the same way than for the H2, the maximum number of headers that can be
encoded when headers are sent must be increased to match the limit imposed
when they are received.
Reasons are the sames. On receive path, the maximum number of headers
accepted must be higher than the configured limit to be able to handle
pseudo headers and cookies headers. On the sending path, the same limit must
be applied because the pseudo headers will consume some extra slots and the
cookie header could be splitted.
BUG/MEDIUM: h3: Properly limit the number of headers received
The number of headers are limited before the decoding but pseudo headers and
cookie headers consume extra slots. In practice, this lowers the maximum number
of headers that can be received.
To workaround this issue, the limit is doubled during the frame decoding to be
sure to have enough extra slots. And the number of headers is tested against the
configured limit after the HTX message was created to be able to report an
error. Unfortunatly no parsing error are reported because the QUIC multiplexer
is not able to do so for now.
The same is performed on trailers to be consistent with H2.
BUG/MEDIUM: mux-h2: Check the number of headers in HEADERS frame after decoding
There is no explicit test on the number of headers when a HEADERS frame is
received. It is implicitely limited by the size of the header list. But it
is twice the configured limit to be sure to decode the frame.
So now, a check is performed after the HTX message was created. This way, we
are sure to not exceed the configured limit after the decoding stage. If
there are too many headers, a parsing error is reported.
Note the same is performed on the trailers.
This patch should patially address the issue #2685. It should be backported
to all stable versions.
BUG/MEDIUM: mux-h2: Increase max number of headers when encoding HEADERS frames
When a HEADERS frame is encoded to be sent, the maximum number of headers
allowed in the frame is lower than on receiving path. This can lead to
report a sending error while the message was accepted. It could be
confusing.
In addition, the start-line is splitted into pseudo-headers and consummes
this way some header slots, increasing the difference between HEADERS frames
encoding and decoding. It is even more noticeable because when a HEADERS
frame is decoded, a margin is used to be able to handle splitted cookie
headers. Concretly, on decoding path, a limit of twice the maxumum number of
headers allowed in a message (tune.http.maxhdr * 2) is used. On encoding
path, the exact limit is used. It is not consistent.
Note that when a frame is decoded, we must use a larger limit because the
pseudo headers are reassembled in the start-line and must count for one. But
also because, most of time, the cookies are splitted into several headers
and are reassembled too.
To fix the issue, the same ratio is applied on sending path. A limit must be
defined because an dynamic allocation is not acceptable. Twice of the
configured limit should be good enough to support headers manipulation.
This patch should be backported to all stable versions.
MINOR: quic: TX part modifications to support BBR.
Very few modifications: call ->on_transmit() and ->drs_on_transmit() congestion
control algorithm (quic_cc) callbacks from qc_send_ppkts() just after having
sents some packets.
qc_notify_cc_of_newly_acked_pkts() aim is to notify the congestion algorithm
of all the packet acknowledgements. It must call quic_cc_drs_update_rate_sample()
to update the delivery rate sampling information. It must also call
quic_cc_drs_on_ack_recv() to update the state of the delivery rate sampling part
used by BBR.
Finally, ->on_ack_rcvd() is called with the total number of bytes delivered
by the sender from the newly acknowledged packets with <bytes_delivered> as
parameter to do so. <pkt_delivered> store the per-packet number of bytes
delivered by the newly sent acknowledged packet (the packet with the highest
packet number). <bytes_lost> is also used and has been set by
qc_packet_loss_lookup() before calling qc_notify_cc_of_newly_acked_pkts().
MINOR: quic: quic_loss modifications to support BBR
qc_packet_loss_lookup() aim is to detect the packet losses. This is this function
which must called ->on_pkt_lost() BBR specific callback. It also set
<bytes_lost> passed parameter to the total number of bytes detected as lost upon
an ACK frame receipt for its caller.
Modify qc_release_lost_pkts() to call ->congestion_event() with the send time
from the newest packet detected as lost.
Modify qc_release_lost_pkts() to call ->slow_start() callback only if define
by the congestion control algorithm. This is not the case for BBR.
Add several callbacks to quic_cc_algo struct which are only called by BBR.
->get_drs() may be used to retrieve the delivery rate sampling information
from an congestion algorithm struct (quic_cc).
->on_transmit() must be called before sending any packet a QUIC sender.
->on_ack_rcvd() must be called after having received an ACK.
->on_pkt_lost() must be called after having detected a packet loss.
->congestion_event() must be called after any congestion event detection
Modify quic_cc.c to call ->event only if defined. This is not the case
for BBR.
Here is an extract from the Abstract part to sum up the the capabilities of BBR:
BBR ("Bottleneck Bandwidth and Round-trip propagation time") uses recent
measurements of a transport connection's delivery rate, round-trip time, and
packet loss rate to build an explicit model of the network path. BBR then uses
this model to control both how fast it sends data and the maximum volume of data
it allows in flight in the network at any time. Relative to loss-based congestion
control algorithms such as Reno [RFC5681] or CUBIC [RFC9438], BBR offers
substantially higher throughput for bottlenecks with shallow buffers or random
losses, and substantially lower queueing delays for bottlenecks with deep buffers
(avoiding "bufferbloat"). BBR can be implemented in any transport protocol that
supports packet-delivery acknowledgment. Thus far, open source implementations
are available for TCP [RFC9293] and QUIC [RFC9000].
In haproxy, this implementation is considered as still experimental. It depends
on the newly implemented pacing feature.
BBR was asked in GH #2516 by @KazuyaKanemura, @osevan and @kennyZ96.
This patch implements an algorithm which may be used by congestion algorithms
for QUIC to estimate the current delivery rate of a sender. It is at least used
by BBR and could be used by others congestion algorithms as cubic.
This algorithm was specified by an RFC draft here:
https://datatracker.ietf.org/doc/html/draft-cheng-iccrg-delivery-rate-estimation
before being merged into BBR v3 here:
https://datatracker.ietf.org/doc/html/draft-cardwell-ccwg-bbr#section-4.5.2.2
Implement the Kathleen Nichols' algorithm used by several congestion control
algorithm implementation (TCP/BBR in Linux kernel, QUIC/BBR in quiche) to track
the maximum value of a data type during a fixe time interval.
In this implementation, counters which are periodically reset are used in place
of timestamps.
Only the max part has been implemented.
(see lib/minmax.c implemenation for Linux kernel).
MINOR: quic: Add the congestion window initial value to QUIC path
Add ->initial_wnd new member to quic_cc_path struct to keep the initial value
of the congestion window. This member is initialized as soon as a QUIC connection
is allocated. This modification is required for BBR congestion control algorithm.
-Ws on VTest is not working correctly for an unknown reason, the polling
of the NOTIFY_SOCKET seems to timeout, and VTest never receives the
READY message.
This patch disables the reg-tests using -Ws on OS X.
REGTESTS: switch to -Ws for master-worker reg-tests
The -W mode implemented in VTest is not reliable anymore, because VTest
waits for the pidfile to be created. But with the new master-worker
mode, this file is created long before haproxy is ready. This can lead
to the test being started too soon, and failing from time to time.
The -Ws option allows to wait for haproxy to deliver a message to VTest
once it is ready.
OPTION: map/hlua: make core.set_map() lookup more efficient
0844bed7d3 ("MEDIUM: map/acl: Improve pat_ref_set() efficiency (for
"set-map", "add-acl" action perfs)") improved lookup efficiency for
set-map http action, but the core.set_map() lua method which is built
on the same construct was overlooked. Let's also benefit from this optim
as it easily applies.
Willy Tarreau [Wed, 20 Nov 2024 14:51:35 +0000 (15:51 +0100)]
DOC: config: indent the list of environment variables
In the doc our lists are indented but for any reason this one was not,
making it harder to visually delimit. Let's just indent it. No need to
backport this, it's totally cosmetic and would need adaptations since
it was recently touched.
There are some variables, which are set by HAProxy process (HAPROXY_*). Some
of them are handy to check or to redefine in the configuration, in order to
create conditional blocks and make the configuration more flexible. But it
wasn't clear in the documentation, which variables are really safe and usefull
to redefine and which ones could be only read via "show env" output.
Latest changes in master-worker architecture makes the existed description even
more confusing.
So let's sort all HAPROXY_* variables to four categories and let's also mark
explicitly, which ones are set in which process, when haproxy is started in
master-worker mode.
In addition, update examples in chapter "2.4. Conditional blocks". This might
bring more ideas for users how HAPROXY_* variables could be used in the
conditional blocks.
Before this patch HAPROXY_LOCALPEER variable could be set in init_early(),
in init_args() and in cfg_parse_global(). In master-worker mode, if localpeer
keyword set in the global section, HAPROXY_LOCALPEER in the worker
environment is set to this keyword's value, but in the master environment it
still keeps the default, a localhost name. This is confusing.
To fix it, let's set HAPROXY_LOCALPEER only once, when a worker or process in a
standalone mode has finished to parse its configuration. And let's set this
variable only for the worker process or for the process in a standalone mode,
because the master doesn't need it.
HAPROXY_LOCALPEER takes the value saved in localpeer global variable, which is
always set by default in init_early() to the local hostname. Then, localpeer
could be reset in init_args (-L option) and in cfg_parse_global() (while
parsing "localpeer" keyword).
Willy Tarreau [Wed, 20 Nov 2024 13:44:28 +0000 (14:44 +0100)]
BUILD: makefile: make ERR apply to build options as well
Once in a while we find some makefiles ignoring some outdated arguments
and just emit a warning. What's annoying is that if users (say, distro
packagers), have purposely added ERR=1 to their build scripts to make
sure to fail on any warning, these ones will be ignored and the build
can continue with invalid or missing options.
William rightfully suggested that ERR=1 should also catch make's warnings
so this patch implements this, by creating a new "complain" variable that
points either to "error" or "warning" depending on $(ERR), and that is
used to send the messages using $(call $(complain),...). This does the
job right at little effort (tested from GNU make 3.82 to 4.3).
Note that for this purpose the ERR declaration was upped in the makefile
so that it appears before the new errors.mk file is included.
MEDIUM: mworker: remove USE_SYSTEMD requirement for -Ws
Since sd_notify() is now implemented in src/systemd.c, there is no need
anymore to build its support conditionnally with USE_SYSTEMD.
This patch add supports for -Ws for every build and removes the
USE_SYSTEMD build option. It also remove every reference to USE_SYSTEMD
in the documentation and the CI.
This also allows to run the reg-tests in -Ws with the new VTest support.
Amaury Denoyelle [Wed, 20 Nov 2024 10:09:35 +0000 (11:09 +0100)]
BUG/MINOR: cfgparse-quic: fix renaming of max-window-size
A patch has recently tried to rename QUIC max-window-size global
parameter to default-max-window-size to better reflect its usage.
However, only the documentation was edited but not cfgparse-quic.c.
Fix this by updating cfgparse-quic.c with the new default- naming.
MINOR: http-fetch: Add an option to 'query" to get the QS with the '?'
As mentionned by Thayne McCombs in #2728, it could be handy to have a sample
fetch function to retrieve the query string with the question mark
character.
Indeed, for now, "query" sample fetch function already extract the query
string from the path, but the question mark character is not
included. Instead of adding a new sample fetch function with a too similar
name, an optional argument is added to "query". If "with_qm" is passed as
argument, the question mark will be included in the query string, but only
if it is not empty.
Thanks to this patch, the following rule:
http-request redirect location /destination?%[query] if { -m found query } some_condition
http-request redirect location /destination if some_condition
can now be expressed this way:
http-request redirect location /destination%[query(with_qm)] if some_condition
BUG/MINOR: http-ana: Adjust the server status before the L7 retries
The server status must be adjusted, if necessary, at each retry. It is
properly performed when "obersve layer4" directive is set. But for the layer
7, only the last attempt was considered.
When the L7 retries were implemented, all retries were added before the
server status adjutement. So only the last attempt was considered. To fix
the issue, we must adjut the server status first, and then try to perform a
L7 retry.
This patch should fix the issue #2679. It must be backported to all stable
versions.
Willy Tarreau [Wed, 20 Nov 2024 07:44:39 +0000 (08:44 +0100)]
DOC: configuration: explain quotes and spaces in conditional blocks
Conditional blocks inherit the same tokenizer and argument parser as
the rest of the configuration, but are also silently concatenated
around groups of spaces and tabs. This can lead to subtle failures
for configs containing spaces around commas and parenthesis, where
a string comparison might silently fail for example. Let's better
document this particular case.
Thanks to Valentine for analysing and reporting the problem.
Willy Tarreau [Wed, 20 Nov 2024 07:42:02 +0000 (08:42 +0100)]
DOC: configuration: explain the rules regarding spaces in arguments
Spaces around commas or parenthesis in expressions are generally part
of the value due to the long history of supporting unquoted arguments.
But this tends to come as a surprise to new users and sometimes creates
subtly invalid configurations. Let's add some text covering this.
Willy Tarreau [Tue, 19 Nov 2024 14:56:59 +0000 (15:56 +0100)]
MINOR: tasklet: support an optional set of wakeup flags to tasklet_wakeup_on()
tasklet_wakeup_on() and its derivates (tasklet_wakeup_after() and
tasklet_wakeup()) do not support passing a wakeup cause like
task_wakeup(). This is essentially due to an API limitation cause by
the fact that for a very long time the only reason for waking up was
to process pending I/O. But with the growing complexity of mux tasks,
it is becoming important to be able to skip certain heavy processing
when not strictly needed.
One possibility is to permit the caller of tasklet_wakeup() to pass
flags like task_wakeup(). Instead of going with a complex naming scheme,
let's simply make the flags optional and be zero when not specified. This
means that tasklet_wakeup_on() now takes either 2 or 3 args, and that the
third one is the optional flags to be passed to the callee. Eligible flags
are essentially the non-persistent ones (TASK_F_UEVT* and TASK_WOKEN_*)
which are cleared when the tasklet is executed. This way the handler
will find them in its <state> argument and will be able to distinguish
various causes for the call.
Willy Tarreau [Tue, 19 Nov 2024 14:33:39 +0000 (15:33 +0100)]
MINOR: tasklet: make the low-level tasklet API take a flag
Everything in the tasklet layer supports flags, except that they are
just not implemented in the wakeup functions, while they are in the
task_wakeup functions. Initially it was not considered useful to pass
wakeup causes because these were essentially I/O, but with the growing
number of I/O handlers having to deal with various types of operations
(typically cheap I/O notifications on subscribe vs heavy parsing on
application-level wakeups), it would be nice to start to make this
distinction possible.
This commit extends _tasklet_wakeup_on() and _tasklet_wakeup_after()
to pass a set of flags that continues to be set as zero. For now this
changes nothing, but new functions will come.
Willy Tarreau [Tue, 19 Nov 2024 15:27:46 +0000 (16:27 +0100)]
MINOR: sched: add TASK_F_WANTS_TIME to make the scheduler update the call date
Currently tasks being profiled have th_ctx->sched_call_date set to the
current nanosecond in monotonic time. But there's no other way to have
this, despite the scheduler being capable of it. Let's just declare a
new task flag, TASK_F_WANTS_TIME, that makes the scheduler take the time
just before calling the handler. This way, a task that needs nanosecond
resolution on the call date will be able to be called with an up-to-date
date without having to abuse now_mono_time() if not needed. In addition,
if CLOCK_MONOTONIC is not supported (now_mono_time() always returns 0),
the date is set to the most recently known now_ns, which is guaranteed
to be atomic and is only updated once per poll loop.
This date can be more conveniently retrieved using task_mono_time().
This can be useful, e.g. for pacing. The code was slightly adjusted so
as to merge the common parts between the profiling case and this one.
Willy Tarreau [Tue, 19 Nov 2024 17:14:05 +0000 (18:14 +0100)]
MINOR: tinfo/clock: turn sched_call_date to 64-bits
We used to store it in 32-bits since we'd only use it for latency and CPU
usage calculation but usages will evolve so let's not truncate the value
anymore. Now we store the full 64 bits. Note that this doesn't even
increase the storage size due to alignment. The 3 usage places were
verified to still be valid (most were already cast to 32 bits anyway).
Willy Tarreau [Tue, 19 Nov 2024 15:42:40 +0000 (16:42 +0100)]
MINOR: stream: don't update s->lat_time when the wakeup date is not set
In 2.7 was added a stream wakeup latency calculation with commit 6a28a30efa ("MINOR: tasks: do not keep cpu and latency times in struct
task"). However, due to the transformation of the previous code, it
kept unconditionally updating s->lat_time even of the sched_wake_date
was zero. In other words, s->lat_time is constantly updated for the
huge majority of calls that are made without profiling. Let's just
check the sched_wake_date status before doing so.
Willy Tarreau [Tue, 19 Nov 2024 15:26:04 +0000 (16:26 +0100)]
DOC: sched: document the missing TASK_F_UEVT* flags
These are user-defined one-shot events that are application-specific
and reset upon wakeup and were not documented. No backport is needed
since these were added to 3.1.
Contrary to what the doc states, it is not expected (nor relevant) to
use yield-dependent methods such as core.yield() or core.(m)sleep() from
contexts that don't support yielding. Such contexts include body, init,
fetches and converters.
Thus the doc got it wrong since the beginning, because such methods were
never supported from the above contexts, yet it was listed in the list
of compatible contexts (probably the result of a copy-paste), which is
error-prone because it could either cause a Lua runtime error to be
thrown, or be ignored in some other cases.
Every reg-test now runs without any warning, so let's acivate -dW by
default so the new ones will inheritate the option.
This patch reverts 9d511b3c ("REGTESTS: enable -dW on almost all tests
to fail on warnings") and adds -dW in the default HAPROXY_ARGS of
scripts/run-regtests.sh instead.
Most of the invalid or unknow field in the stats-file parser are ignored
silently, which is not the case of the frontend/backend mismatch on a
guid, which is kind of strange.
Since this is ""documented"" to be ignored in the
reg-tests/stats/sample-stats-file file, let's also ignore this kind of
line. This will allow to run the associated reg-test with -dW.
Define a new QUIC congestion algorithm token 'cubic-pacing' for
quic-cc-algo bind keyword. This is identical to default cubic
implementation, except that pacing is used for STREAM frames emission.
This algorithm supports an extra argument to specify a burst size. This
is stored into a new bind_conf member named quic_pacing_burst which can
be reuse to initialize quic path.
Pacing support is still considered experimental. As such, 'cubic-pacing'
can only be used with expose-experimental-directives set.
Modify quic-cc-algo for better extensability of optional parameters
parsing. This will be useful to support a new parameter for maximum
allowed pacing burst size.
Take this opportunity to refine quic-cc-algo documentation. Optional
parameters are now presented as a list which would be soon extended.
Amaury Denoyelle [Tue, 19 Nov 2024 09:12:27 +0000 (10:12 +0100)]
MINOR: quic: use dynamic cc_algo on bind_conf
A QUIC congestion algorithm can be specified on the bind line via
keyword quic-cc-algo. As such, bind_conf structure has a member
quic_cc_algo.
Previously, if quic-cc-algo was set, bind_conf member was initialized to
one of the globally defined CC algo structure. This patch changes
bind_conf quic_cc_algo initialization to point to a dynamically
allocated copy of CC algo structure.
With this change, it will be possible to tweak individually each CC algo
of a bind line. This will be used to activate pacing on top of the
congestion algorithm.
As bind_conf member is dynamically allocated now, its member is now
freed via free_proxy() to prevent any leak.
Amaury Denoyelle [Mon, 18 Nov 2024 13:55:41 +0000 (14:55 +0100)]
MAJOR: mux-quic: support pacing emission
Support pacing emission for STREAM frames at the QUIC MUX layer. This is
implemented by adding a quic_pacer engine into QCC structure.
The main changes have been written into qcc_io_send(). It now
differentiates cases when some frames have been rejected by transport
layer. This can occur as previously due to congestion or FD buffer full,
which requires subscribing on transport layer. The new case is when
emission has been interrupted due to pacing timing. In this case, QUIC
MUX I/O tasklet is rescheduled to run with the flag TASK_F_USR1.
On tasklet execution, if TASK_F_USR1 is set, all standard processing for
emission and reception is skipped. Instead, a new function
qcc_purge_sending() is called. Its purpose is to retry emission with the
saved STREAM frames list. Either all remaining frames can now be send,
subscribe is done on transport error or tasklet must be rescheduled for
pacing purging.
In the meantime, if tasklet is rescheduled due to other conditions,
TASK_F_USR1 is reset. This will trigger a full regeneration of STREAM
frames. In this case, pacing expiration must be check before calling
qcc_send_frames() to ensure emission is now allowed.
Amaury Denoyelle [Wed, 16 Oct 2024 15:56:15 +0000 (17:56 +0200)]
MINOR: mux-quic: encapsulate QCC tasklet wakeup
QUIC MUX will be responsible to drive emission with pacing. This will be
implemented via setting TASK_F_USR1 before I/O tasklet wakeup. To
prepare this, encapsulate each I/O tasklet wakeup into a new function
qcc_wakeup().
This commit is purely refactoring prior to pacing implementation into
QUIC MUX.
Amaury Denoyelle [Mon, 18 Nov 2024 08:53:39 +0000 (09:53 +0100)]
MINOR: mux-quic: define a tx STREAM frame list member
For STREAM emission, MUX QUIC previously used a local list defined under
qcc_io_send(). This was suitable as either all frames were sent, or
emission must be interrupted due to transport congestion or fatal error.
In the latter case, the list was emptied anyway and a new frame list was
built on future qcc_io_send() invokation.
For pacing, MUX QUIC may have to save the frame list if pacing should be
applied across emission. This is necessary to avoid to unnecessarily
rebuilt stream frame list between each paced emission. To support this,
STREAM list is now stored as a member of QCC structure.
Ensure frame list is always deleted, even on QCC release, using newly
defined utility function qcc_tx_frms_free().
Amaury Denoyelle [Mon, 18 Nov 2024 14:31:12 +0000 (15:31 +0100)]
MINOR: quic/pacing: add burst support
qc_send_mux() has been extended previously to support pacing emission.
This will ensure that no more than one datagram will be emitted during
each invokation. However, to achieve better performance, it may be
necessary to emit a batch of several datagrams one one turn.
A so-called burst value can be specified by the user in the
configuration. However, some congestion control algos may defined their
owned dynamic value. As such, a new CC callback pacing_burst is defined.
quic_cc_default_pacing_burst() can be used for algo without pacing
interaction, such as cubic. It will returns a static value based on user
selected configuration.
Amaury Denoyelle [Thu, 24 Oct 2024 14:06:48 +0000 (16:06 +0200)]
MINOR: quic/pacing: support pacing emission on quic_conn layer
Pacing will be implemented for STREAM frames emission. As such,
qc_send_mux() API has been extended to add an argument to a quic_pacer
engine.
If non NULL, engine will be used to pace emission. In short, no more
than one datagram will be emitted for each qc_send_mux() invokation.
Pacer is then notified about the emission and a timer for a future
emission is calculated. qc_send_mux() will return PACING error value, to
inform QUIC MUX layer that it will be responsible to retry emission
after some delay.
Amaury Denoyelle [Mon, 18 Nov 2024 13:57:39 +0000 (14:57 +0100)]
MINOR: quic/pacing: implement quic_pacer engine
Extend quic_pacer engine to support pacing emission. Several functions
are defined.
* quic_pacing_sent_done() to notify engine about an emission of one or
several datagrams
* quic_pacing_expired() to check if emission should be delayed or can be
conducted immediately
Amaury Denoyelle [Fri, 25 Oct 2024 14:31:26 +0000 (16:31 +0200)]
MINOR: quic: extend qc_send_mux() return type with a dedicated enum
This commit is part of a adjustment on QUIC transport send API to
support pacing. Here, qc_send_mux() return type has been changed to use
a new enum quic_tx_err.
This is useful to explain different failure causes of emission. For now,
only two values have been defined : NONE and FATAL. When pacing will be
implemented, a new value would be added to specify that emission was
interrupted on pacing. This won't be a fatal error as this allows to
retry emission but not immediately.
Amaury Denoyelle [Fri, 11 Oct 2024 15:57:57 +0000 (17:57 +0200)]
MINOR: quic: support a max number of built packet per send iteration
Extend QUIC transport emission function to support a maximum datagram
argument. The purpose is to ensure that qc_send() won't emit more than
the specified value, unless it is 0 which is considered as unlimited.
In qc_prep_pkts(), a counter of built datagram has been added to support
this. The packet building loop is interrupted if it reaches a specified
maximum value. Also, its return value has been changed to the number of
prepared datagrams. This is reused by qc_send() to interrupt its work if
a specified max datagram argument value is reached over one or several
iteration of prepared/sent datagrams.
This change is necessary to support pacing emission. Note that ideally,
the total length in bytes of emitted datagrams should be taken into
account instead of the raw number of datagrams. However, for a first
implementation, it was deemed easier to implement it with the latter.
Amaury Denoyelle [Wed, 16 Oct 2024 15:50:30 +0000 (17:50 +0200)]
MINOR: quic: simplify qc_prep_pkts() exit path
To prepare pacing support, qc_prep_pkts() exit path have been rewritten
to be easily modified. This is purely refactoring which should not have
any functional change :
* a dedicated error path has been added
* ensure qc_txb_store() is always called to finalize datagram on normal
exit path if first_pkt is not NULL. Needed to support breaking from
packet building loop in a easier way.
Amaury Denoyelle [Tue, 19 Nov 2024 14:11:20 +0000 (15:11 +0100)]
DOC: quic: rename max-window-size as with default prefix
Rename 'tune.quic.frontend.max-window-size' with the prefix 'default-'.
This highlights the fact that it is not a hard limit, as it can be
overriden if specifying an optional window size via quic-cc-algo on a
bind line.
No need to backport as this keyword was added on the current dev
version.
BUG/MINOR: http_ana: Report -1 for %Tr for invalid response only
The server response time is erroneously reported as -1 when it is
intercepted by HAProxy.
As stated in the documentation, the server response time is reported as -1
when the last response header was never seen. It happens when a server
timeout is triggered before the server managed to process the request. It
also happens if the response is invalid. This may be reported by the mux
during the response parsing, but also by the HTTP analyzers. However, in
this last case, the response time must only be reported as -1 on 502.
This patch must be backported to all stable versions. It should fix the
issue #2384.
When the request is too large to fit in a buffer a 414 or a 431 error
message is returned depending on the error state of the request parser. A
414 is returned if the URI is too long, otherwise a 431 is returned.
MINOR: http: Add support for HTTP 414/431 status codes
414-Uri-Too-Long and 431-Request-Header-Fields-Too-Large are now part of
supported status codes that can be define as error files. The hash table
defined in http_get_status_idx() was updated accordingly.
MINOR: stream: Add an option to "show sess" command to dump the captured URI
"show sess" command now supports a list of options that can be set after all
other possible arguments (<id>, all...). For now, "show-uri" is the only
supported option. With this options, the captured URI, if non-null, is added
to the dump of a stream, complete or now. The URI may be anonymized if
necessary.
MINOR: agent-check: Be able to set absolute weight via an agent
Historically, an agent-check program is only able to set a proportial weight
to the initial server's weight. However, it could be handy to also set an
absolute value. It is the purpose of this patch.
Instead of changing the current way to set a server's weight, a new
agent-check command is introduced. The string "weight:", followed by an
positive interger or a positive interger percentage, can now be used. If the
value ends with the '%' sign, then the new weight will be proportional to
the initially weight of the server. Otherwise, the value is considered as an
absolute weight and must be between 0 and 256.
MINOR: http-ana: Add support for "set-cookie-fmt" option to redirect rules
It is now possible to use a log-format string to define the "Set-Cookie"
header value of a response generated by a redirect rule. There is no special
check on the result format and it is not possible during the configuration
parsing. It is proably not a big deal because already existing "set-cookie"
and "clear-cookie" options don't perform any check.
MINOR: http-ana: Add option to keep query-string on a localtion-based redirect
On prefix-based redirect, there is an option to drop the query-string of the
location. Here it is the opposite. an option is added to preserve the
query-string of the original URI for a localtion-based redirect.
By setting "keep-query" option, for a location-based redirect only, the
query-string of the original URI is appended to the location. If there is no
query-string, nothing is added (no empty '?'). If there is already a
non-empty query-string on the localtion, the original one is appended with
'&' separator.
MINOR: config: show HAPROXY_BRANCH in "show env" output
Before this patch HAPROXY_BRANCH was unset just after configuration parsing.
Let's keep it, as it could be used in conditional blocks and some
configuration directives and it's handy to check its runtime value via "show
env".
In master-worker mode, this variable is set to the same value for both
processes.
MINOR: cli: make "show env" accessible via master CLI without enabling debug
Before this patch, we have need to put the master CLI in debug mode to be able
to issue 'show env' command for the master process. Output of this command is
handy even for the master process context, as it allows to control its
environment variables, which could be used/modified in the 'global' section.
So, let's provide in 'show env' command structure the level ACCESS_MASTER.
This allows to see and to access this command in master CLI without putting it
in debug mode.
BUG/MINOR: mworker-prog: don't warn about deprecated section with expose-deprecated-directives
As master parses now expose-deprecated-directives option, let's emit warning
about deprecated 'progam' section only in case, if this option wasn't set in
the 'global' section. This allows to people, who don't prefer to remove the
'program' section immediately to continue to start the process in zero-warning
mode.
Adjust the warning message accordingly and mcli_start_progs.vtc test. As
expose-deprecated-directives option is a 'global' section keyword, this section
must always precede any 'program' section, if users still continue to keep
'program' section.
This doesn't need to be backported, as related to the latest changes in
the master-worker architecture.
MINOR: cfgparse-global: parse options to allow non std keywords in discovery mode
'Program' section is considered as deprecated now, see the commit 581c8a27d98c
("MEDIUM: mworker: depreciate the 'program' section"). So, the 'program'
section parser emits a warning every time since this commit, if its section is
presented. This makes impossible to launch the process in zero-warning mode.
After master-worker refactoring only the master process parses the 'program'
section. So, at first, in order to be able to start in zero-warning mode, we
need to parse in master process option, which allows deprecated keywords. Thus,
let's set in this commit KWF_DISCOVERY flag to
cfg_parse_global_non_std_directives parser, which parses
'expose-deprecated-directives' and 'expose-deprecated-directives' options.
Willy Tarreau [Tue, 19 Nov 2024 09:56:45 +0000 (10:56 +0100)]
MINOR: ring: support unit suffixes in the size
The ring size used to take only numbers and silently ignore letters (due
to atol()), resulting it tiny buffers when trying to collect traces and
using e.g. "size 10g". Let's make use of parse_size_err() to properly
parse units.