OPTIM: stconn: Don't pretend mux have more data to deliver on EOI/EOS/ERROR
Doing some benchs on the 3.0, we encountered a small loss on requests/sec on
small objects compared to the 2.8 . After bisecting the issue, it appeared
that this was introduced when the mux-to-mux zero-copy data forwarding was
implemented in 2.9-dev8. Extra subscribes on receives at the end of the
message were responsible of the loss.
A basic configuration, sending H2 requests to a H1 server returning
responses without payload is enough to observe the issue. With the following
command, we can observe a huge increase of epoll_ctl calls on 2.9/3.x:
h2load -c 100 -m 10 -n 100000 http://...
On 2.8 we have around 3200 calls to epoll_ctl against more than 20k on 3.1.
The fix seems obvious. After a receive, there is no reason to state a mux
have more data to deliver if EOI/EOS/ERROR flag was set on the
stream-endpoint descriptor. With this change, extra calls to epoll_ctl
disappear. However it is a sensitive part so it is important to keep an eye
on it and to not backport it.
Thanks to Willy and Emeric to have spot the issue.
OPTIM: channel: speed up co_getline()'s search of the end of line
Previously, co_getline() was essentially used for occasional parsing
in peers's banner or Lua, so it could afford to read one character at
a time. However now it's also used on the TCP log path, where it can
consume up to 40% CPU as mentioned in GH issue #2731. Let's speed it
up by using memchr() to look for the LF, and copying the data at once
using memcpy().
Previously it would take 2.44s to consume 1 GB of log on a single
thread of a Core i7-8650U, now it takes 1.56s (-36%).
MINOR: tools: do not attempt to use backtrace() on linux without glibc
The function is provided by glibc. Nothing prevents us from using our
own outside of glibc there (tested on aarch64 with musl). We still do
not enable it by default as we don't yet know if all archs work well,
but it's sufficient to pass USE_BACKTRACE=1 when building with musl to
verify it's OK.
BUILD: tools: only include execinfo.h for the real backtrace() function
No need to include this possibly non-existing file when using our own
backtrace() implementation, it's only needed for the libc-provided one.
Because of this it's currently not possible to build musl with backtrace
enabled.
MINOR: server: make srv_shutdown_sessions() call pendconn_redistribute()
When shutting down server sessions, the queue was not considered, which
is a problem if some element reached the queue at the moment the server
was going down, because there will be no more requests to kick them out
of it. Let's always make sure we scan the queue to kick these streams
out of it and that they can possibly find a more suitable server. This
may make a difference in the time it takes to shut down a server on the
CLI when lots of servers are in the queue.
It might be interesting to backport this to 3.0 but probably not much
further.
BUG/MINOR: queue: make sure that maintenance redispatches server queue
Turning a server to maintenance currently doesn't redispatch the server
queue unless there's an explicit "option redispatch" and no "option
persist", while the former has never really been the purpose of this
test. Better refine this so that forced maintenance also causes the
queue to be flushed, and possibly redispatched unless the proxy has
option persist. This way now when turning a server to maintenance,
the queue is immediately flushed and streams can decide what to do.
This can be backported, though there's no need to go far since it was
never directly reported and only noticed as part of debugging some
rare "shutdown sessions" strangeness, which it might participate to.
BUG/MINOR: server: make sure the HMAINT state is part of MAINT
In 1.8 when adding "set server fqdn" with commit b418c1228c ("MINOR:
server: cli: Add server FQDNs to server-state file and stats socket."),
the HMAINT flag was not made part of the MAINT ones, so technically
speaking when changing the FQDN, the server is not completely considered
as in maintenance mode.
In its defense, the code location around that was completely messy, with
the aggregator flag being hidden between other values and purposely but
discretely ignoring one of the flags, so the comments were updated to
make the intent clearer (particularly regarding CMAINT which looked like
it was also forgotten while it was on purpose).
BUG/MEDIUM: stream: make stream_shutdown() async-safe
The solution found in commit b500e84e24 ("BUG/MINOR: server: shut down
streams under thread isolation") to deal with inter-thread stream
shutdown doesn't work fine because there exists code paths involving
a server lock which can then deadlock on thread_isolate(). A better
solution then consists in deferring the shutdown to the stream itself
and just wake it up for that.
The only thing is that TASK_WOKEN_OTHER is a bit too generic and we
need to pass at least 2 types of events (SF_ERR_DOWN and SF_ERR_KILLED),
so we're now leveraging the new TASK_F_UEVT1 and _UEVT2 flags on the
task's state to convey these info. The caller only needs to wake the
task up with these flags set, and the stream handler will then finish
the job locally using stream_shutdown_self().
This needs to be carefully backported to all branches affected by the
dequeuing issue and containing any of the 5541d4995d ("BUG/MEDIUM:
queue: deal with a rare TOCTOU in assign_server_and_queue()"), and/or b11495652e ("BUG/MEDIUM: queue: implement a flag to check for the
dequeuing").
MINOR: task: define two new one-shot events for use with WOKEN_OTHER or MSG
TASK_WOKEN_MSG only says "someone sent you a message" but doesn't convey
any info about the message. TASK_WOKEN_OTHER says "you're woken for another
reason" but doesn't tell which one. Most often they're used as-is by the
task handlers to report very specific situations.
For some important control notifications, having the ability to modulate
the message a little bit is useful, so let's define two user event types
UEVT1 and UEVT2 to be used in conjunction with TASK_WOKEN_MSG or _OTHER
so that the application can know that a specific condition was explicitly
requested. It will be used this way:
Since events are cumulative, keep in mind not to consider a 3rd value
as the combination of EVT1+EVT2; these really mean that the two events
appeared (though in unspecified order).
Thread isolation does not work well for this, there exists code paths
which already hold the server's lock and result in a deadlock. Let's
revert that and address it better without isolation.
Now that proxy "log-steps" keyword was implemented and is usable since
("MEDIUM: log: consider log-steps proxy setting for existing log origins")
let's add some tests for it in reg-tests/log/log_profile.vtc.
MEDIUM: log: consider log-steps proxy setting for existing log origins
During tcp/http transaction processing, haproxy may produce logs at
different steps during the processing (accept, connect, request,
response, close). But the behavior is hardly configurable because
haproxy will only emit a single log per transaction, and by default
it will try to produce the log once all log aliases or fetches used
in the logformat could be satisfied, which means the log is often
emitted during connection teardown, unless "option logasap" is used.
We were often asked to have a way to emit multiple logs for a single
transaction, like for instance emit log during accept, then request,
response and close for instance, see GH #401 for more context.
Thanks to "log-steps" keyword introduced by commit "MINOR: log:
introduce "log-steps" proxy keyword", it is now possible to explictly
configure when logs should be generated by haproxy when processing a
transaction. This commit adds the required checks so that log-steps
proxy option is properly considered for existing logs generated by
haproxy. If "log-steps" is not specified on the proxy, the old behavior
is preserved.
Note: a slight cpu overhead should only be visible when "log-steps"
keyword will be used due to the implementation relying on eb32 lookup
instead of basic bitfield check as described in "MINOR: proxy: add
log_steps struct member". However, the default behavior shouldn't be
affected.
When combining log-steps with log-profiles, user has the ability to
explicitly control how and when haproxy should generate logs during
requests handling.
For now it is only available for proxies with frontend capability because
log-steps are only evaluated under sess_log() or strm_log() which
essentially focus on the frontend side when it comes to log settings so
it's better to keep it this way for better consistency, at least for now.
For now the setting does nothing (it is not considered during runtime),
it will be implemented and documented in upcoming commits.
add proxy->conf.log_steps eb32 root tree which will be used to store the
log origin identifiers that should result in haproxy emitting a log as
configured by the user using upcoming "log-steps" proxy keyword.
It was chosen to use eb32 tree instead of simple bitfield because despite
the slight overhead it is more future-proof given that we already
implemented the prerequisites for seamless custom log origins registration
that will also be usable from "log-steps" proxy keyword.
MINOR: log: support extra log origins for '%OG' alias
Following previous commits, let's improve log_orig_to_str() so that
extra log origins (registered through log_orig_register()) can be
translated to string from origin ID.
For that, it is required to add eb_32 tree node to log_origin struct in
order to enable quick integer lookup during runtime. Slow name lookup
using the list is acceptable for config parsing, but it is not the case
during runtime when log_orig_to_str() is expected to be used. Also, to
prevent duplicated info, get rid of ->id field and use ->tree.key instead
MINOR: log: explicitly handle extra log origins as error when relevant
Thanks to previous commit, we can know check for log_orig optional flags
in functions taking struct log_orig as parameter. Let's take this
opportunity to add the LOG_ORIG_FL_ERROR flag and check this flag at a
few places to handle the log message differently because if the flag is
set then the caller expects the log to be handled as an error explicitly.
e.g.: in _process_send_log_override(), if the flag is set, use the error
log format instead of the dedicated one.
Rename 'enum log_orig' to 'enum log_orig_id', since this enum specifically
contains the log origin ids.
Add 'struct log_orig' which wraps 'enum log_orig' with optional flags
(no flags defined for now).
Add log_orig() helper func that takes id and flags as parameter and
returns log_orig struct initialized with input arguments.
Update functions taking log origin as parameter so they explicitly take
log orig id or log orig wrapper as argument depending on the level of
context expected by the function.
MINOR: log: handle extra log origins in _process_send_log_override()
Thanks to the previous commit, it is now possible to register additional
log origins that may be used from log-profile section as 'on' steps.
As such, let's make _process_send_log_override() function aware of them
by trying to lookup in the tree of extra logging steps in the default
switch-case catchall. If the log origin id matches with the id of the
extra logging step, we use the associated log format instead of the
"any" log format.
add a way to register additional log origins using log_origin_register()
that may be used as log profile steps from log profile sections.
For now this does nothing as no extra origins are registered and extra log
origins are not yet considered for runtime logging paths.
When specifying an extra logging step for on <step> under log-profile
section, the logging step is stored within a binary tree for efficient
lookup during runtime. No performance impact should be expected if extra
log origins are not being used, and slight performance impact if extra
log origins are used.
Don't forget to update the documentation when new log origins are added
(both %OG log alias and on <step> log-profile keyword are concerned.
Oliver Dala [Wed, 25 Sep 2024 09:37:25 +0000 (11:37 +0200)]
BUG/MEDIUM: cli: Deadlock when setting frontend maxconn
The proxy lock state isn't passed down to relax_listener
through dequeue_proxy_listeners, which causes a deadlock
in relax_listener when it tries to get that lock.
Backporting: Older versions didn't have relax_listener and directly called
resume_listener in dequeue_proxy_listeners. lpx should just be passed directly
to resume_listener then.
BUG/MEDIUM: cli: Be sure to catch immediate client abort
A client abort while nothing was sent is properly handled except when this
immediately happens after the connection was accepted. The read0 event is
caught before the CLI applet is created. In that case, the shutdown is not
handled and the applet is no longer wakeup. In that case, the stream remains
blocked and no timeout are armed.
The bug was due to the fact that when the applet I/O handler was called for
the first time, the applet context was initialized and nothing more was
performed. A shutdown, if any, would be handled on the next call. In that
case, it was too late.
Now, afet the init step, we loop to eval the first command. There is no
command here but the shutdown will be tested.
This patch should fix the issue #2727. It must be backported to 3.0.
MEDIUM: mailers: warn about deprecated legacy mailers
As mentioned in 2.8 announce on the mailing list [1] and on the wiki [2],
use of legacy mailers is now deprecated and will not be supported anymore
starting with version 3.3. Use of Lua script (AKA Lua mailers) is now
encouraged (and fully supported since 2.8) for this purpose, as it offers
more flexibility (e.g: alerts can be customized) and is more future-proof.
Configurations relying on legacy mailers will now raise a warning.
Users willing to keep their existing mailers config in a working state
should simply add the following line to their global section:
# mailers.lua file as provided in the git repository
# adjust path as needed
lua-load examples/lua/mailers.lua
Add missing wait for Slg4 introduced in f8299bc ("MINOR: log: "drop"
support for log-profile steps"), and missing barrier increase due to
the use of barrier sync, which could have resulted in the regtest
being timing-sentive and thus less-reliable.
Also, the "error" check in Slg4 wasn't even considered because it is
emitted by frontend 4, not frontend 2..
BUG/MINOR: proxy: also make the cli and resolvers use the global name
As detected by ASAN on the CI, two places still using strdup() on the
proxy names were left by commit b325453c3 ("MINOR: proxy: use the global
file names for conf->file").
BUG/MINOR: server: shut down streams under thread isolation
Since the beginning of thread support, the shutdown of streams attached
to a server was run under the server's lock, but that's not sufficient.
It indeed turns out that shutting down streams (either from the CLI using
"shutdown sessions server XXX" or due to "on-error shutdown-sessions")
iterates over all the streams to shut them down, but stream_shutdown()
has no way to protect its actions against concurrent actions from the
stream itself on another thread, and streams offer no such provisions
anyway.
The impact is some rare but possible crashes when shutting down streams
from the CLI in cmopetition with high server traffic. The probability
is low enough to mark it minor, though it was observed in the field.
At least since 2.4 the streams are arranged in per-thread lists, so it
likely would be possible using the event subsystem to delegate these
events to dedicated per-thread tasks which would address the problem.
But server streams don't get killed often enough to justify such extra
complexity, so better just run the loop under thread isolation.
It also shows that the internal API could probably be improved to
support a lighter thread exclusion instead of full isolation: various
places want to only exclude one thread and here it could work. But
again there's no point doing this for now.
This patch should be backported to all stable branches. It's important
to carefully check that this srv_shutdowns_streams() function is never
called itself under isolation in older versions (though at first glance
it looks OK).
MEDIUM: cfgparse: warn about deprecated use of duplicate server names
As discussed below, there are too many problems and limitations caused
by still supporting duplicate server names. That's already particularly
complicated and dissuasive to use since it requires these servers to
have explicit IDs to be accept. Let's now warn on any duplicate, even
with explicit IDs and remind that this will become forbidden in 3.3.
OPTIM: cfgparse: speed up duplicate server detection
Surprisingly, the duplicate server name detection has never made use
of the names tree, so lookups were still in O(N^2). It took 1 second
to validate 50k servers spread into 25 backends at 2k per backend.
By simply using the tree (and since the current server already is in
the tree), we just have to walk using ebpt_prev_dup to visit previous
servers with the same name. We can then detect which ones conflict
without having an ID set and error. The config check time is now 1/4
of the previous one for 2k servers per backend, and more importantly
it will make it simpler to check for any duplicates later.
MEDIUM: cfgparse: drop duplicate named defaults sections after use
It has never been permitted to explicitly reference named defaults
sections for which there are duplicate names. This means that when
a duplicate defaults section is found, there's no point in keeping
it since it will never be used for lookups, so it can be dropped.
However, some such defaults sections might have some rules in them
that are implicitly referenced by proxies placed after them. In this
case they cannot be removed.
What is done here is that upon each new named section creation, if
another one is found with the same name, its config location is stored
into the new proxy's {prev_file,prev_line} pair, and the old section is
either destroyed if its refcount is null, or just unindexed. The dup
check when creating a new proxy now consists in checking the prev_line
instead of performing a dup lookup on the defaults section.
This will guarantee that we can't find duplicate defaults sections in
their tree anymore, while still keeping track of what's allocated and
releasing everything upon exit.
Beyond the consistency gain, there are nice savings for large configs
involving many defaults sections: a test with 300k sections saved
about 1.9 GB of RAM, and started 25% faster likely thanks to spending
less time allocating memory.
MINOR: proxy: add a list of orphaned defaults sections
We'll soon delete unreferenced and duplicated named defaults sections
from the list of proxies. The problem with this is that this list (in
fact a name-based tree) is used to release all of them at the end. Let's
add a list of orphaned defaults sections, typically those containing
"http-check send" statements or various other rules, and that are
implicitly inherited by a proxy hence have a non-zero refcount while
also having a name. These now makes it possible to remove them from
the name index while still keeping their memory around for the lifetime
of the process, and cleaning it at the end.
BUG/MINOR: cfgparse: detect another uncaught case of duplicate defaults
The following sequence was not properly caught:
defaults def
backend back from def
defaults def
But this one was:
defaults def
defaults def
backend back from def
Let's check when defaults are declared that they're not already
referenced.
Better not backport this. While it will catch broken configs (possibly
some with backends pasted after the wrong defaults), these might still
work by accident. It may be reported as a diag warning though.
CLEANUP: cfgparse: factor proxy vs log-forward collisions
This simplifies the check added in 1a38684fbc ("MEDIUM: cfgparse:
detect collisions between defaults and log-forward"), by factoring it
with the other existing one.
The tests are ugly in that code because a first block tests pure
proxies, a second one proxies or defaults and inside that one we
have special cases for defaults. Let's just move the tests to the
"any proxy type" block.
MINOR: proxy: use the global file names for conf->file
Proxy file names are assigned a bit everywhere (resolvers, peers,
cli, logs, proxy). All these elements were enumerated and now use
copy_file_name(). The only ha_free() call was turned to drop_file_name().
As a bonus side effect, a 300k backend config saved 14 MB of RAM.
CLEANUP: stick-table: make the file location point to a global file name
The file name used to point to the calling function's stack for stick
tables, which was OK during parsing but remained dangling afterwards.
At least it was already marked const so as not to accidentally free it.
Let's make it point to a file_name_node now.
In proxies, stick-tables, servers, etc... at plenty of places we store
a file name and a line number. Some file names are the result of strdup()
(e.g. in proxies), others not (e.g. stick-tables) and leave dangling
pointers at the end of parsing. The risk of double-free is not null
either.
In order to stop this, let's first add a simple tool that allows to
register short strings inside a global list, these strings happening
to be server names. The strings are either duplicated and stored upon
failure to find them, or just added to this storage. Since file names
are not expected to disappear before the end of the process, for now
we don't even implement refcounting, and we free them all at the end.
There's already a drop_file_name() function to reset the pointer like
ha_free() used to do, and even if not strictly needed it's a good
habit to get used to doing it.
The strings are returned as const so that they're stored as-is in
structs, and that nasty free() calls are easily caught. The pointer
points to the char[] storage inside the node itself. This way later
if we want to implement refcounting, it will be trivial to just look
up a string and change its associated node's refcount. If needed,
comparisons can also be made on pointers.
For now they're not used yet and are released on deinit().
Released version 3.1-dev8 with the following main changes :
- DOC: configuration: place the HAPROXY_HTTP_LOG_FMT example on the correct line
- MINOR: mux-h1: Set EOI on SE during demux when both side are in DONE state
- BUG/MEDIUM: mux-h1/mux-h2: Reject upgrades with payload on H2 side only
- REGTESTS: h1/h2: Update script testing H1/H2 protocol upgrades
- BUG/MEDIUM: clock: detect and cover jumps during execution
- BUG/MINOR: pattern: prevent const sample from being tampered in pat_match_beg()
- BUG/MEDIUM: pattern: prevent uninitialized reads in pat_match_{str,beg}
- BUG/MEDIUM: pattern: prevent UAF on reused pattern expr
- MEDIUM: ssl/cli: "dump ssl cert" allow to dump a certificate in PEM format
- BUG/MAJOR: mux-h1: Wake SC to perform 0-copy forwarding in CLOSING state
- BUG/MINOR: h1-htx: Don't flag response as bodyless when a tunnel is established
- REGTESTS: fix random failures with wrong_ip_port_logging.vtc under load
- BUG/MINOR: pattern: do not leave a leading comma on "set" error messages
- REGTESTS: shorten a bit the delay for the h1/h2 upgrade test
- MINOR: server: allow init-state for dynamic servers
- DOC: server: document what to check for when adding new server keywords
- MEDIUM: h1: Accept invalid T-E values with accept-invalid-http-response option
- BUG/MINOR: polling: fix time reporting when using busy polling
- BUG/MINOR: clock: make time jump corrections a bit more accurate
- BUG/MINOR: clock: validate that now_offset still applies to the current date
- BUG/MEDIUM: queue: implement a flag to check for the dequeuing
- OPTIM: sample: don't check casts for samples of same type
- OPTIM: vars: remove the unneeded lock in vars_prune_*
- OPTIM: vars: inline vars_prune() to avoid many calls
- MINOR: vars: remove the emptiness tests in callers before pruning
- IMPORT: import cebtree (compact elastic binary trees)
- OPTIM: vars: use a cebtree instead of a list for variable names
- OPTIM: vars: use multiple name heads in the vars struct
- BUG/MINOR: peers: local entries updates may not be advertised after resync
- DOC: config: Explicitly list relaxing rules for accept-invalid-http-* options
- MINOR: proxy: Rename accept-invalid-http-* options
- DOC: configuration: Remove dangerous directives from the proxy matrix
- BUG/MEDIUM: sc_strm/applet: Wake applet after a successfull synchronous send
- BUG/MEDIUM: cache/stats: Wait to have the request before sending the response
- BUG/MEDIUM: promex: Wait to have the request before sending the response
- MINOR: clock: test all clock_gettime() return values
- MEDIUM: clock: collect the monotonic time in clock_local_update_date()
- MEDIUM: clock: opportunistically use CLOCK_MONOTONIC for the internal time
- MEDIUM: clock: use the monotonic clock for idle time calculation
- MEDIUM: clock: don't compute before_poll when using monotonic clock
- BUG/MINOR: fix missing "log-format overrides previous 'option tcplog clf'..." detection
- BUG/MINOR: fix missing "'option httpslog' overrides previous 'option tcplog clf'..." detection
- BUG/MINOR: cfgparse-listen: fix option httpslog override warning message
- BUG/MINOR: cfgparse: detect incorrect overlap of same backend names
- MEDIUM: cfgparse: warn about proxies having the same names
- DOC: management: add init-state to add server keywords
- BUG/MINOR: mux-quic: report glitches to session
- BUILD: cebtree: silence a bogus gcc warning on impossible code paths
- MEDIUM: cfgparse: warn about colliding names between defaults and proxies
- MEDIUM: cfgparse: detect collisions between defaults and log-forward
MEDIUM: cfgparse: detect collisions between defaults and log-forward
Sadly, when log-forward were introduced they took great care of avoiding
collision with regular proxies but defaults were missed (they need to be
explicitly checked for). So now we have to move them to a warning for 3.1
instead of rejecting them.
MEDIUM: cfgparse: warn about colliding names between defaults and proxies
In order to complete the checks added in 303a66573d ("MEDIUM: cfgparse:
warn about proxies having the same names"), we also need to warn about
regular proxies having the same name as defaults sections as well as
defaults sections having the same name as proxies, since defaults
sections are inherently proxies, albeit stored in a separate list for
now.
BUILD: cebtree: silence a bogus gcc warning on impossible code paths
gcc-12 and above report a wrong warning about a negative length being
passed to memcmp() on an impossible code path when built at -O0. The
pattern is the same at a few places, basically:
int foo(int op, const void *a, const void *b, size_t size, size_t arg)
{
if (op == 1) // arg is a strict multiple of size
return memcmp(a, b, arg - size);
return 0;
}
...
int bar()
{
return foo(0, a, b, sizeof(something), 0);
}
It *might* be possible to invent dummy values for the "len" argument
above in the real code, but that significantly complexifies it and as
usual can easily result in introducing undesired bugs.
Here we take a different approach consisting in shutting the
-Wstringop-overread warning on gcc>=12 at -O0 since that's the only
condition that triggers it. The issue was reported to and confirmed by
the gcc team here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114622
No backport needed, but this should be upstreamed into cebtree after
checking that all involved macros are available.
Glitch counter was implemented for QUIC/HTTP3. The counter is stored in
the QCC MUX connection instance. However, this is never reported at the
session level which is necessary if glitch counter is tracked via a
stick-table.
To fix this, use session_add_glitch_ctr() in various QUIC MUX functions
which may increment glitch counter.
MEDIUM: cfgparse: warn about proxies having the same names
As discussed below, there are too many problems and uncaught bugs
in the parser when trying to support proxies having similar names
but different types. There's specific code to detect the presence
of stick-tables in a pair of such proxies for example. It's even
possible that certain combinations of backend+listen that were not
previously detected have some nasty side effects.
According to the proposal in the discussion, this is now deprecated
in 3.1 (thus we emit a warning) and will become forbidden in 3.3.
A backport might be useful, but reporting a diag_warning only, not a
classical warning, so as not to break setups running in zero-warning
mode.
It was verified with a config involving all 9 combinations of
(frontend,backend,listen) followed by one of the same three that all
collisions are now properly blocked and that only back+front are kept
and emit a warning.
BUG/MINOR: cfgparse: detect incorrect overlap of same backend names
As reported below, it's possible to declare a backend then a proxy with
the same name, because for the proxy we check a frontend capability (the
first one to be tested):
backend b
listen b
bind :8888
Let's check the two capabilities in this case and not just the frontend.
Better not backport this, as there's a risk of breakage of existing
setups that work by accident. It might make sense to report them as
diag warnings though.
"option httpslog" override warning messaged used to be reported as
"option httplog", probably as a result of copy paste without adjusting
the context. Let's fix that to prevent emitting confusing warning messages
The issue exists since 98b930d ("MINOR: ssl: Define a default https log
format"), thus it should be backported up to 2.6
In commit fd48b28315 ("MINOR: Implements new log format of option tcplog clf")
"option tcplog clf" detection was correcly added for "option tcplog" and
"option httplog", but "log-format" case was overlooked. Thus, this config
would report erroneous warning message:
MEDIUM: clock: don't compute before_poll when using monotonic clock
There's no point keeping both clocks up to date; if the monotonic clock
is ticking, let's just refrain from updating the wall clock one before
polling since we won't use it. We still do it after polling however as
we need a wall clock time to communicate with outside.
This saves one gettimeofday() call per loop and two timeval comparisons.
MEDIUM: clock: use the monotonic clock for idle time calculation
By just keeping a copy of the last known value before entering
polling, we can apply the same algorithm as we're currently using,
except that it's now applied to the monotonic clock instead of the
wall clock, when it's detected that it's ticking. This improves
idle time calculation accuracy by making it independent on the
wall clock.
MEDIUM: clock: opportunistically use CLOCK_MONOTONIC for the internal time
We already collect CLOCK_MONOTONIC when it's available when leaving the
poller, but it's only used for profiling. The functions that return it
set the value to zero when it's not available, so we can use that to
detect if it works or not. The idea is that if the monotonic time is
non-zero, it is ticking and usable, then we use if for now_ns, otherwise
we use the corrected date. We continue to apply the now_offset to the
returned value because it helps forcing an early time wrap-around.
Proceeding like this presents two benefits:
- on systems supporting this, the time is much more robust against
time changes
- when it works, it saves us from having to go through the time
correction code, which is usually cheap, but better avoided anyway.
Note that idle time calculation continues to rely on the wall-clock
time.
MEDIUM: clock: collect the monotonic time in clock_local_update_date()
Now we collect this clock in clock_local_update_date(), the closest from
the poller, which is also used when busy-polling, and the values is set
into the thread's curr_mono_time which did not exist before. Later,
clock_leaving_poll() just sets the prev_mono_time value from the curr_
one instead of retrieving the time at this specific point. It also means
that the monotonic time will now also cover the time needed to update
the global time, which should be negligible. Note that we don't collect
the CPU time in the clock_local_update_date() function even though it's
tempting, because when doing busy-polling, it would be collected on each
round while being useless.
Doing so will make sure that the local time always knows the monotonic
time when it is available.
MINOR: clock: test all clock_gettime() return values
Till now we were only using clock_gettime() for profiling, so if it
would fail it was no big deal. We intend to use it as the main clock
as well now, so we need to more reliably detect its absence or failure
and gracefully fall back to other options. Without the test we would
return anything present in the stack, which is neither clean nor easy
to detect.
BUG/MEDIUM: promex: Wait to have the request before sending the response
It is similar to the previous fix about the stats applet ("BUG/MEDIUM:
cache/stats: Wait to have the request before sending the response").
However, for promex, there is no crash and no obvious issue. But it depends
on the filter. Indeed, the request is used by promex, independantly if it
was considered as forwarded or not. So if it is modified by the filter,
modification are just ignored.
Same bug, same fix. We now wait the request was forwarded before processing
it and produce the response.
BUG/MEDIUM: cache/stats: Wait to have the request before sending the response
It seems obvious. On a classical workflow, the request headers analysis is
finished when these applets are woken up for the first time. So they don't
take care to really have the request to start to process it and to send the
response. But with a filter, it is possible to stop the request analysis
after the applet creation.
If this happens for the stats applet, this leads to a crash because we
retrieve the request start-line without checking if it is available. For the
cache applet, the response is just immediatly sent. And here it is a problem
if the compression is enabled. In that case too, this may lead to a crash
because the compression may be enabled but not initialized.
For a true server, there is no issue because the connection cannot be
established. The server is chosen only after the request analysis. The issue
with applets is that once created, an applet is quickly switched to the
established state. So it is probably a point that must be carefully reviewed
and probably reworked.
In the mean time, as a fix, in the cache and the stats applet, we just take
care to have the request before sending the response. This will do the
trick.
The patch must be backported as far as 2.6. On 2.6, the patch must be adapted.
BUG/MEDIUM: sc_strm/applet: Wake applet after a successfull synchronous send
On a synchronous send from the stream to an applet, if some data were sent,
we must take care to wake the applet up. It is important because if
everything was sent at this stage, there is no other chance to wake the
applet up, mainly because SE_FL_WAIT_DATA flag is set on the applet's sedesc
in sc_update_tx() at the end of process_stream(). This flag prevent any
wakeup of the applet for a send event.
It is not necessary for a mux because the mux stream is called when a
syncrhonous send from the stream is performed. So it is reponsible to wake
the mux connection if necessary.
DOC: configuration: Remove dangerous directives from the proxy matrix
For now, that only concerns accept-invalid-http-{request/response} and
accept-unsafe-violations-in-http-{request/response}. But the idea is to make
dangerous directives hard to find. It is one more way to discourage anyone
to use it. And, optionnaly, it is also handy because it keeps the matrix
aligned on 80 columns.
With these options, it is possible to accept some invalid messages that may
considered as unsafe and may result as vulnerabilities. The naming is not
explicit enough on this point. These option must really be considered as
dangerous and only used as a temporary workaround. Unfortunately, when used,
it is probably because there are some legacy and unsupported applications in
place. Nevermind. The documentation warns about the use of these
options. Now the name of the options itself is a warning.
So now, "accept-invalid-http-request" and "accept-invalid-http-response"
options are deprecated and replaced by
"accept-unsafe-violations-in-http-request" and
"accept-unsafe-violations-in-http-response" options.
DOC: config: Explicitly list relaxing rules for accept-invalid-http-* options
Time to time, new exceptions are added in the HTTP parsing (most of time H1)
to not reject some invalid messages sent by legacy applications. But the
documentation of accept-invalid-http-request and
accept-invalid-http-response options is not pretty clear. So, now, there is
an explicit list of relaxing rules for both options.
BUG/MINOR: peers: local entries updates may not be advertised after resync
Since commit 864ac3117 ("OPTIM: stick-tables: check the stksess without
taking the read lock"), when entries for a local table are learned from
another peer upon resynchro, and this is the only peer haproxy speaks to,
local updates on such entries are not advertised to the peer anymore,
until they eventually expire and can be recreated upon local updates.
This is due to the fact that ts->seen is always set to 0 when creating
new entry, and also when touch_remote is performed on the entry.
Indeed, while 864ac3117 attempts to avoid useless updates, it didn't
consider entries learned from a remote peer. Such entries are exclusively
learned in peer_treat_updatemsg(): once the entry is created (or updated)
with new data, touch_remote is used to commit the change. However, unlike
touch_local, entries committed using touch_remote will not be advertised
to the peer from which the entry was just learned (otherwise we would
enter a looping situation). Due to the above patch, once an entry is
learned from the (unique) remote peer, 'seen' will be stuck to 0 so it
will never be advertised for its whole lifetime.
Instead, when entries are learned from a peer, we should consider that
the peer that taught us the entry has seen it.
To do this, let's set seen=1 in peer_treat_updatemsg() after calling
touch_remote(). This way, if we happen to perform updates on this entry,
it will be properly advertized to relevant peers. This patch should not
affect the performance gain documented in 864ac3117 given that the test
scenario didn't involved entries learned by remote peers, but solely
locally created entries advertised to remote peers upon updates.
OPTIM: vars: use multiple name heads in the vars struct
Given that the original list-based version was using a list head as the
root of the variables, while the tree is using a single pointer, it made
sense to reuse that space to place multiple roots, indexed on the lower
bits of the name hash. Two roots slightly increase the performance level,
but the best gain is obtained with 4 roots. The performance is now always
above that of the list, even with small counts, and with 100 vars, it's
21% higher than before, or 67% higher than with the list.
We keep the same lock (it could have made sense to use one lock per head),
because most of the variables in large configs are attached to a stream
or a session, hence are not shared between threads. Thus there's no point
in sharding the pointer.
OPTIM: vars: use a cebtree instead of a list for variable names
Configs involving many variables can start to eat a lot of CPU in name
lookups. The reason is that the names themselves are dynamic in that
they are relative to dynamic objects (sessions, streams, etc), so
there's no fixed index for example. The current implementation relies
on a standard linked list, and in order to speed up lookups and avoid
comparing strings, only a 64-bit hash of the variable's name is stored
and compared everywhere.
But with just 100 variables and 1000 accesses in a config, it's clearly
visible that variable name lookup can reach 56% CPU with a config
generated this way:
for i in {0..100}; do
printf "\thttp-request set-var(txn.var%04d) int(%d)" $i $i;
for j in {1..10}; do [ $i -lt $j ] || printf ",add(txn.var%04d)" $((i-j)); done;
echo;
done
The performance and a 4-core skylake 4.4 GHz reaches 85k RPS with a perf
profile showing:
The performance lowers a bit earlier than with the list however. What
can be seen is that the performance maintains a plateau till 25 vars,
starts degrading a little bit for the tree while it remains stable till
28 vars for the list. Then both cross at 42 vars and the list continues
to degrade doing a hyperbole while the tree resists better. The biggest
loss is at around 32 variables where the list stays 10% higher.
Regardless, given the extremely narrow band where the list is better, it
looks relevant to switch to this in order to preserve the almost linear
performance of large setups. For example at 1000 variables and 10k
lookups, the tree is 18 times faster than the list.
In addition this reduces the size of the struct vars by 8 bytes since
there's a single pointer, though it could make sense to re-invest them
into a secondary head for example.
This is an import of the compact elastic binary trees at commit a9cd84a ("OPTIM: descent: better prefetch less and for writes when
deleting")
These will be used to replace certain lists (and possibly certain
tree nodes as well). They're as fast (or even faster) than ebtrees
for lookups, as fast for insertion and slower for deletion, and a
node only uses 2 pointers (like a list).
The only changes were cebtree.h where common/tools.h was replaced
with ebtree.h which we already have and already provides the needed
functions and macros, and the addition of a wrapper cebtree-prv.h in
src/ to redirect to import/cebtree-prv.h.
MINOR: vars: remove the emptiness tests in callers before pruning
All callers of vars_prune_* currently check the list for emptiness.
Let's leave that to vars_prune() itself, it will ease some changes in
the code. Thanks to the previous inlining of the vars_prune() function,
there's no performance loss, and even a very tiny 0.1% gain.
OPTIM: vars: remove the unneeded lock in vars_prune_*
vars_prune() and vars_prune_all() take the variable lock while purging
all variables from a head. However this is not needed:
- proc scope variables are only purged during deinit, hence no lock
is needed ;
- all other scopes are attached to entities bound to a single thread
so no lock is needed either.
Removing the lock saves about 0.5% CPU on variables-intensive setups,
but above all simplify the code, so let's do it.
OPTIM: sample: don't check casts for samples of same type
Originally when converters were created, they were mostly for casting
types. Nowadays we have many artithmetic converters to perform operations
on integers, and a number of converters operating on strings. Both of
these categories most often do not need any cast since the input and
output types are the same, which is visible as the cast function is
c_none. However, profiling shows that when heavily using arithmetic
converters, it's possible to spend up to ~7% of the time in
sample_process_cnv(), a good part of which is only in accessing the
sample_casts[] array. Simply avoiding this lookup when input and ouput
types are equal saves about 2% CPU on such setups doing intensive use
of converters.
BUG/MEDIUM: queue: implement a flag to check for the dequeuing
As unveiled in GH issue #2711, commit 5541d4995d ("BUG/MEDIUM: queue:
deal with a rare TOCTOU in assign_server_and_queue()") does have some
side effects in that it can occasionally cause an endless loop.
As Christopher analysed it, the problem is that process_srv_queue(),
which uses a trylock in order to leave only one thread in charge of
the dequeueing process, can lose the lock race against pendconn_add().
If this happens on the last served request, then there's no more thread
to deal with the dequeuing, and assign_server_and_queue() will loop
forever on a condition that was initially exepected to be extremely
rare (and still is, except that now it can become sticky). Previously
what was happening is that such queued requests would just time out
and since that was very rare, nobody would notice.
The root of the problem really is that trylock. It was added so that
only one thread dequeues at a time but it doesn't offer only that
guarantee since it also prevents a thread from dequeuing if another
one is in the process of queuing. We need a different criterion.
What we're doing now is to set a flag "dequeuing" in the server, which
indicates that one thread is currently in the process of dequeuing
requests. This one is atomically tested, and only if no thread is in
this process, then the thread grabs the queue's lock and dequeues.
This way it will be serialized with pendconn_add() and no request
addition will be missed.
It is not certain whether the original race covered by the fix above
can still happen with this change, so better keep that fix for now.
Thanks to @Yenya (Jan Kasprzak) for the precise and complete report
allowing to spot the problem.
This patch should be backported wherever the patch above was backported.
BUG/MINOR: clock: validate that now_offset still applies to the current date
We want to make sure that now_offset is still valid for the current
date: another thread could very well have updated it by detecting a
backwards jump, and at the very same moment the time got fixed again,
that we retrieve and add to the new offset, which results in a larger
jump. Normally, for this to happen, it would mean that before_poll
was also affected by the jump and was detected before and bounded
within 2 seconds, resulting in max 2 seconds perturbations.
Here we try to detect this situation and fall back to re-adjusting the
offset instead.
It's more of a strengthening of what's done by commit e8b1ad4c2b
("BUG/MEDIUM: clock: also update the date offset on time jumps") than a
pure fix, in that the issue was not direclty observed but it's visibly
possible by reading the code, so this should be backported along with
the patch above. This is related to issue GH #2704.
Note that this could be simplified in terms of operations by migrating
the deadlines to nanoseconds, but this was the path to least intrusive
changes.
BUG/MINOR: clock: make time jump corrections a bit more accurate
Since commit e8b1ad4c2b ("BUG/MEDIUM: clock: also update the date offset
on time jumps") we try to update the now_offet based on the last known
valid date. But if it's off compared to the global_now_ns date shared
by other threads, we'll get the time off a little bit. When this happens,
we should consider the most recent of these dates so that if the global
date was already known to be more recent, we should use it and stick to
it. This will avoid setting too large an offset that could in turn provoke
a larger jump on another thread.
This is related to issue GH #2704.
This can be backported to other branches having the patch above.
BUG/MINOR: polling: fix time reporting when using busy polling
Since commit beb859abce ("MINOR: polling: add an option to support
busy polling") the time and status passed to clock_update_local_date()
were incorrect. Indeed, what is considered is the before_poll date
related to the configured timeout which does not correspond to what
is passed to the poller. That's not correct because before_poll+the
syscall's timeout will be crossed by the current date 100 ms after
the start of the poller. In practice it didn't happen when the poller
was limited to 1s timeout but at one minute it happens all the time.
That's particularly visible when running a multi-threaded setup with
busy polling and only half of the threads working (bind ... thread even).
In this case, the fixup code of clock_update_local_date() is executed
for each round of busy polling. The issue was made really visible
starting with recent commit e8b1ad4c2b ("BUG/MEDIUM: clock: also
update the date offset on time jumps") because upon a jump, the
shared offset is reset, while it should not be in this specific
case.
What needs to be done instead is to pass the configured timeout of
the poller (and not of the syscall), and always pass "interrupted"
set so as to claim we got an event (which is sort of true as it just
means the poller returned instantly). In this case we can still
detect backwards/forward jumps and will use a correct boundary
for the maximum date that covers the whole loop.
This can be backported to all versions since the issue was introduced
with busy-polling in 1.9-dev8.
MEDIUM: h1: Accept invalid T-E values with accept-invalid-http-response option
Since the 2.6, A parsing error is reported when the chunked encoding is
found twice. As stated in RFC9112, A sender must not apply the chunked
transfer coding more than once to a message body. It means only one chunked
coding must be found. In addition, empty values are also rejected becaues it
is forbidden by RFC9110.
However, in both cases, it may be useful to relax the rules for trusted
legacy servers when accept-invalid-http-response option is set. Especially
because it was accepted on 2.4 and older. In addition, T-E header is now
sanitized before sending it. It is not a problem Because it is a hop-by-hop
header
Note that it remains invalid on client side because there is no good reason
to relax the parsing on this side. We can argue a server is trusted so we
can decide to support some legacy behavior. It is not true on client side
and it is highly suspicious if a client is sending an invalid T-E header.
Note also we continue to reject unsupported T-E values (so all codings except
"chunked"). Because the "TE" header is sanitized and cannot contain other value
than "Trailers", there is absolutely no reason for a server to use something
else.
This patch should fix the issue #2677. It could probably be backported as
far as 2.6 if necessary.
DOC: server: document what to check for when adding new server keywords
It's too easy to overlook the dynamic servers when adding new server
keywords, and the fields on each keyword line are totally obscure. This
commit adds a title to each column of the table and explains what is
expected and what to check for when adding a keyword.
MINOR: server: allow init-state for dynamic servers
Commit 50322df introduced the init-state keyword, but it didn't enable
it for dynamic servers. However, this feature is perfectly desirable
for virtual servers too, where someone would like a server inlived
through "set server be1/srv1 state ready" to be put out of maintenance
in down state until the next health check succeeds.
At reading the code, it seems that it's only a matter of allowing this
keyword for dynamic servers, as current code path calls
srv_adm_set_ready() which incidentally triggers a call to
_srv_update_status_adm().
REGTESTS: shorten a bit the delay for the h1/h2 upgrade test
Commit d6c4ed9a96 ("REGTESTS: h1/h2: Update script testing H1/H2
protocol upgrades") introduced a 0.5 second delay which is higher
than those of most other tests (usually 0.05 or 0.2) and triggers
timeouts on my side. Let's just shorten it to 0.2 since its goal
is only to send data separately.
Note: maybe a barrier approach would be possible, though not
studied.
BUG/MINOR: pattern: do not leave a leading comma on "set" error messages
Commit 4f2493f355 ("BUG/MINOR: pattern: pat_ref_set: fix UAF reported by
coverity") dropped the condition to concatenate error messages and as
such introduced a leading comma in front of all of them. Then commit 911f4d93d4 ("BUG/MINOR: pattern: pat_ref_set: return 0 if err was found")
changed the behavior to stop at the first error anyway, so all the
mechanics dedicated to the concatenation of error messages is no longer
needed and we can simply return the error as-is, without inserting any
comma.
This should be backported where the patches above are backported.
REGTESTS: fix random failures with wrong_ip_port_logging.vtc under load
This test has an expect rule for syslog that looks for [cC]D, to
indicate a client abort or timeout during the data phase. The purpose
was to say that when it fails it must be this, but the very low timeout
(1ms) still makes it prone to succeeding if the machine is highly loaded.
This has become more visible since commit e8b1ad4c2b ("BUG/MEDIUM: clock:
also update the date offset on time jumps") because the clock drift
adjustments are more systematic. Since this commit, running 50 such tests
at twice more than the number of CPUs in parallel is sufficient to yield
errors due to some lines appearing as succeeding:
make reg-tests -- --j $((($(nproc)+1)*2)) --vtestparams -n50 reg-tests/log/wrong_ip_port_logging.vtc
It was observed that pauses up to 300ms were observed in epoll_wait() in
such circumstances, which were properly fixed by the time drift detection..
Another approach would consist in increasing the permitted margin during
which we don't fix the clock drift but that would not be logical since the
base time had really been awaited for.
This should be backported to all stable releases since the commit above
will trigger the issue more often.
When a 200-OK response is replied to a CONNECT request or a
101-Switching-protocol, a tunnel is considered as established between the
client and the server. However, we must not declare the reponse as
bodyless. Of course, there is no payload, but tunneled data are expected.
Because of this bug, the zero-copy forwarding is disabled on the server
side.
BUG/MAJOR: mux-h1: Wake SC to perform 0-copy forwarding in CLOSING state
When the mux is woken up on I/O events, if the zero-copy forwarding is
enabled, receives are blocked. In this case, the SC is woken up to be able
to perform 0-copy forwarding to the other side. This works well, except for
the H1C in CLOSING state.
Indeed, in that case, in h1_process(), the SC is not woken up because only
RUNNING H1 connections are considered. As consequence, the mux will ignore
connection closure. The H1 connection remains blocked, waiting for the
shutdown timeout. If no timeout is configured, the H1 connection is never
closed leading to a leak.
This patch should fix leak reported by Damien Claisse in the issue #2697. It
should be backported as far as 2.8.
MEDIUM: ssl/cli: "dump ssl cert" allow to dump a certificate in PEM format
The new "dump ssl cert" CLI command allows to dump a certificate stored
into HAProxy memory. Until now it was only possible to dump the
description of the certificate using "show ssl cert", but with this new
command you can dump the PEM content on the filesystem.
This command is only available on a admin stats socket.
BUG/MEDIUM: pattern: prevent UAF on reused pattern expr
Since c5959fd ("MEDIUM: pattern: merge same pattern"), UAF (leading to
crash) can be experienced if the same pattern file (and match method) is
used in two default sections and the first one is not referenced later in
the config. In this case, the first default section will be cleaned up.
However, due to an unhandled case in the above optimization, the original
expr which the second default section relies on is mistakenly freed.
This issue was discovered while trying to reproduce GH #2708. The issue
was particularly tricky to reproduce given the config and sequence
required to make the UAF happen. Hopefully, Github user @asmnek not only
provided useful informations, but since he was able to consistently
trigger the crash in his environment he was able to nail down the crash to
the use of pattern file involved with 2 named default sections. Big thanks
to him.
To fix the issue, let's push the logic from c5959fd a bit further. Instead
of relying on "do_free" variable to know if the expression should be freed
or not (which proved to be insufficient in our case), let's switch to a
simple refcounting logic. This way, no matter who owns the expression, the
last one attempting to free it will be responsible for freeing it.
Refcount is implemented using a 32bit value which fills a previous 4 bytes
structure gap:
int mflags; /* 80 4 */
/* XXX 4 bytes hole, try to pack */
long unsigned int lock; /* 88 8 */
(output from pahole)
Even though it was not reproduced in 2.6 or below by @asmnek (the bug was
revealed thanks to another bugfix), this issue theorically affects all
stable versions (up to c5959fd), thus it should be backported to all
stable versions.
BUG/MEDIUM: pattern: prevent uninitialized reads in pat_match_{str,beg}
Using valgrind when running map_beg or map_str, the following error is
reported:
==242644== Conditional jump or move depends on uninitialised value(s)
==242644== at 0x2E4AB1: pat_match_str (pattern.c:457)
==242644== by 0x2E81ED: pattern_exec_match (pattern.c:2560)
==242644== by 0x343176: sample_conv_map (map.c:211)
==242644== by 0x27522F: sample_process_cnv (sample.c:1330)
==242644== by 0x2752DB: sample_process (sample.c:1373)
==242644== by 0x319917: action_store (vars.c:814)
==242644== by 0x24D451: http_req_get_intercept_rule (http_ana.c:2697)
In fact, the error is legit, because in pat_match_{beg,str}, we
dereference the buffer on len+1 to check if a value was previously set,
and then decide to force NULL-byte if it wasn't set.
But the approach is no longer compatible with current architecture:
data past str.data is not guaranteed to be initialized in the buffer.
Thus we cannot dereference the value, else we expose us to uninitialized
read errors. Moreover, the check is useless, because we systematically
set the ending byte to 0 when the conditions are met.
Finally, restoring the older value after the lookup is not relevant:
indeed, either the sample is marked as const and in such case it
is already duplicated, or the sample is not const and we forcefully add
a terminating NULL byte outside from the actual string bytes (since we're
past str.data), so as we didn't alter effective string data and that data
past str.data cannot be dereferenced anyway as it isn't guaranteed to be
initialized, there's no point in restoring previous uninitialized data.
It could be backported in all stable versions. But since this was only
detected by valgrind and isn't known to cause issues in existing
deployments, it's probably better to wait a bit before backporting it
to avoid any breakage.. although the fix should be theoretically harmless.
BUG/MINOR: pattern: prevent const sample from being tampered in pat_match_beg()
This is a complementary patch to a68affeaa ("BUG/MINOR: pattern: a sample
marked as const could be written"). Indeed the same logic from
pat_match_str() is used there, but we lack the check to ensure that the
sample is not const before writing data to it.
BUG/MEDIUM: clock: detect and cover jumps during execution
After commit e8b1ad4c2 ("BUG/MEDIUM: clock: also update the date offset
on time jumps"), @firexinghe mentioned that the issue was still present
in their case. In fact it depends on the load, which affects the
probability that the time changes between two poll() calls vs that it
changes during poll(). The time correction code used to only deal with
the latter. But under load if it changes between two poll() calls, what
happens then is that before_poll is off, and after returning from poll(),
the date is within bounds defined by before_poll, so no correction is
applied.
After many tests, it turns out that the most reliable solution without
using CLOCK_MONOTONIC is to prevent before_poll from being earlier than
the previous after_poll (trivial), and to cover forward jumps, we need
to enforce a margin. Given that the watchdog kills a looping task within
2 seconds and that no sane setup triggers it, it seems that 2 seconds
remains a safe enough margin. This means that in the worst case, some
forward jumps of up to 2 seconds will not be corrected, leading to an
apparent fast time and low rates. But this is supposed to be an exceptional
event anyway (typically an admin or crontab running ntpdate).
For future versions, given that we now opportunistically call
now_mono_time() before and after poll(), that returns zero if not
supported, we could imagine relying on this one for the thread's local
time when it's non-null.
"http-messaging/protocol_upgrade.vtc" script was updated to test upgrades
for requests with a payload. It should fail when the request is sent to a H2
server. When sent to a H1 server, it should succeed, except if the server
replies before the end of the request.
BUG/MEDIUM: mux-h1/mux-h2: Reject upgrades with payload on H2 side only
Since 1d2d77b27 ("MEDIUM: mux-h1: Return a 501-not-implemented for upgrade
requests with a body"), it is no longer possible to perform a protocol
upgrade for requests with a payload. The main reason was to be able to
support protocol upgrade for H1 client requesting a H2 server. In that case,
the upgrade request is converted to a CONNECT request. So, it is not
possible to convey a payload in that case.
But, it is a problem for anyone wanting to perform upgrades on H1 server
using requests with a payload. It is uncommon but valid. So, now, it is the
H2 multiplexer responsibility to reject upgrade requests, on server side, if
there is a payload. An INTERNAL_ERROR is returned for the H2S in that
case. On H1 side, the upgrade is now allowed, but only if the server waits
for the end of the request to return the 101-Switching-protocol
response. Indeed, it is quite hard to synchronise the frontend side and the
backend side in that case. Asking to servers to fully consume the request
payload before returned the response seems reasonable.
This patch should fix the issue #2684. It could be backported after a period
of observation, as far as 2.4 if possible. But only if it is not too
hard. It depends on "MINOR: mux-h1: Set EOI on SE during demux when both
side are in DONE state".
MINOR: mux-h1: Set EOI on SE during demux when both side are in DONE state
For now, this case is already handled for all requests except for those
waiting for a tunnel establishment (CONNECT and protocol upgrades). It is
not an issue because only bodyless requests are supported in these cases. So
the request is always finished at the end of headers and therefore before
the response.
However, to relax conditions for full H1 protocol upgrades (H1 client and
server), this case will be necessary. Indeed, the idea is to be able to
perform protocol upgrades for requests with a payload. Today, the "Upgrade:"
header is removed before sending the request to the server. But to support
this case, this patch is required to properly finish transaction when the
server does not perform the upgrade.
DOC: configuration: place the HAPROXY_HTTP_LOG_FMT example on the correct line
When HAPROXY_HTTP_LOG_FMT was added by commit 537b9e7f36 ("MINOR: config:
add environment variables for default log format"), the example was placed
by accident after the clf log format instead of the HTTP log format,
causing a bit of confusion.
Released version 3.1-dev7 with the following main changes :
- MINOR: config: Created env variables for http and tcp clf formats
- MINOR: mux-quic: add buf_in_flight to QCC debug infos
- MINOR: mux-quic: correct qcc_bufwnd_full() documentation
- MINOR: tools: add helpers to backup/clean/restore env
- MINOR: mworker: restore initial env before wait mode
- BUG/MINOR: haproxy: free init_env in deinit only if allocated
- BUILD: tools: environ is not defined in OS X and BSD
- DEV: coccinelle: add a test to detect unchecked malloc()
- DEV: coccinelle: add a test to detect unchecked calloc()
- CI: QUIC Interop AWS-LC: enable ngtcp2 client
- CI: fix missing comma introduced in 956839c0f68a7722acc586ecd91ffefad2ccb303
- CI: QUIC Interop: do not run bandwidth measurement tests
- CI: QUIC Interop: use different artifact names for uploading logs
- BUILD: quic: 32bits build broken by wrong integer conversions for printf()
- CLEANUP: ssl: cleanup the clienthello capture
- MEDIUM: ssl: capture the supported_versions extension from Client Hello
- MEDIUM: ssl/sample: add ssl_fc_supported_versions_bin sample fetch
- MEDIUM: ssl: capture the signature_algorithms extension from Client Hello
- MEDIUM: ssl/sample: add ssl_fc_sigalgs_bin sample fetch
- MINOR: proxy: Add support of 429-Too-Many-Requests in retry-on status
- BUG/MEDIUM: mux-h2: Set ES flag when necessary on 0-copy data forwarding
- BUG/MEDIUM: stream: Prevent mux upgrades if client connection is no longer ready
- BUG/MINIR: proxy: Match on 429 status when trying to perform a L7 retry
- CLEANUP: haproxy: fix typos in code comment
- CLEANUP: mqtt: fix typo in MQTT_REMAINING_LENGHT_MAX_SIZE
- MINOR: tools: Implement ipaddrcpy().
- MINOR: quic: Implement quic_tls_derive_token_secret().
- MINOR: quic: Token for future connections implementation.
- BUG/MINOR: quic: Missing incrementation in NEW_TOKEN frame builder
- MINOR: quic: Modify NEW_TOKEN frame structure (qf_new_token struct)
- MINOR: quic: Implement qc_ssl_eary_data_accepted().
- MINOR: quic: Add trace for QUIC_EV_CONN_IO_CB event.
- BUG/MEDIUM: quic: always validate sender address on 0-RTT
- BUILD: quic: fix build errors on FreeBSD since recent GSO changes
- MINOR: tools: extend str2sa_range to add an alt parameter
- MINOR: server: add a alt_proto field for server
- MEDIUM: sock: use protocol when creating socket
- MEDIUM: protocol: add MPTCP per address support
- BUG/MINOR: quic: Crash from trace dumping SSL eary data status (AWS-LC)
- MEDIUM: stick-table: Add support of a factor for IN/OUT bytes rates
- MEDIUM: bwlim: Use a read-lock on the sticky session to apply a shared limit
- BUG/MEDIUM: mux-pt: Never fully close the connection on shutdown
- BUG/MEDIUM: cli: Always release back endpoint between two commands on the mcli
- BUG/MINOR: quic: unexploited retransmission cases for Initial pktns.
- BUG/MEDIUM: mux-h1: Properly handle empty message when an error is triggered
- MINOR: mux-h2: try to clear DEM_MROOM and MUX_MFULL at more places
- BUG/MAJOR: mux-h2: always clear MUX_MFULL and DEM_MROOM when clearing the mbuf
- BUG/MINOR: mux-spop: always clear MUX_MFULL and DEM_MROOM when clearing the mbuf
- BUG/MINOR: Crash on O-RTT RX packet after dropping Initial pktns
- BUG/MEDIUM: mux-pt: Fix condition to perform a shutdown for writes in mux_pt_shut()
- CLEANUP: assorted typo fixes in the code and comments
- DEV: patchbot: count the number of backported/non-backported patches
- DEV: patchbot: add direct links to show only specific categories
- DEV: patchbot: detect commit IDs starting with 7 chars
- BUG/MEDIUM: clock: also update the date offset on time jumps
- MEDIUM: server: add init-state
Allow the user to set the "initial state" of a server.
Context:
Servers are always set in an UP status by default. In
some cases, further checks are required to determine if the server is
ready to receive client traffic.
This introduces the "init-state {up|down}" configuration parameter to
the server.
- when set to 'fully-up', the server is considered immediately available
and can turn to the DOWN sate when ALL health checks fail.
- when set to 'up' (the default), the server is considered immediately
available and will initiate a health check that can turn it to the DOWN
state immediately if it fails.
- when set to 'down', the server initially is considered unavailable and
will initiate a health check that can turn it to the UP state immediately
if it succeeds.
- when set to 'fully-down', the server is initially considered unavailable
and can turn to the UP state when ALL health checks succeed.
The server's init-state is considered when the HAProxy instance
is (re)started, a new server is detected (for example via service
discovery / DNS resolution), a server exits maintenance, etc.
BUG/MEDIUM: clock: also update the date offset on time jumps
In GH issue #2704, @swimlessbird and @xanoxes reported problems handling
time jumps. Indeed, since 2.7 with commit 4eaf85f5d9 ("MINOR: clock: do
not update the global date too often") we refrain from updating the global
offset in case it didn't change. But there's a catch: in case of a large
time jump, if the poller was interrupted, the local time remains the same
and we return immediately from there without updating the offset. It then
becomes incorrect regarding the "date" value, and upon subsequent call to
the poller, there's no way to detect a jump anymore so we apply the old,
incorrect offset and the date becomes wrong. Worse, going back to the
original time (then in the past), global_now_ns remains higher than the
local time and neither get updated anymore.
What is missing in practice is to immediately update the offset when
detecting a time jump. In an ideal world, the offset would be updated
upon every call, that's what was being done prior to commit above but
it's extremely CPU intensive on large systems. However we can perfectly
afford to update the offset every time we detect a time jump, as it's
not as common.
This needs to be backported as far as 2.8. Thanks to both participants
above for providing very helpful details.
DEV: patchbot: count the number of backported/non-backported patches
It's useful to instantly see how many patches of each category have
already been backported and are still pending, let's count them and
report them at the top of the page.
BUG/MEDIUM: mux-pt: Fix condition to perform a shutdown for writes in mux_pt_shut()
A regression was introduced in the commit 76fa71f7a ("BUG/MEDIUM: mux-pt:
Never fully close the connection on shutdown") because of a typo on the
connection flags. CO_FL_SOCK_WR_SH flag must be tested to prevent a call to
conn_sock_shutw() and not CO_FL_SOCK_RD_SH.
Concretly, most of time, it is harmeless because shutdown for writes is
always performed before any shutdown for reads. Except in case describe by
the commit above. But it is not clear if it has an impact or not.
This patch must be backported with the commit above, so as far as 2.9.
BUG/MINOR: Crash on O-RTT RX packet after dropping Initial pktns
This bug arrived with this naive commit:
BUG/MINOR: quic: Too shord datagram during O-RTT handshakes (aws-lc only)
which omitted to consider the case where the Initial packet number space
could be discarded before receiving 0-RTT packets.
To fix this, append/insert the O-RTT (early-data) packet number space
into the encryption level list depending on the presence or not of
the Initial packet number space.
This issue was revealed when using aws-lc as TLS stack in GH #2701 issue.
Thank you to @Tristan971 for having reported this issue.
Must be backported where the commit mentionned above is supposed to be
backported: as far as 2.9.
BUG/MINOR: mux-spop: always clear MUX_MFULL and DEM_MROOM when clearing the mbuf
That's the equivalent of the mux-h2 one, except that here there's no
real risk to loop since normally we cannot feed data that bypass the
closed state check (e.g. no zero-copy forward). But it still remains
dirty to be able to leave and empty mbuf with MFULL and MROOM set, so
better clear them as well.
BUG/MAJOR: mux-h2: always clear MUX_MFULL and DEM_MROOM when clearing the mbuf
There exists an extremely tricky code path that was revealed in 3.0 by
the glitches feature, though it might theoretically have existed before.
TL;DR: a mux mbuf may be full after successfully sending GOAWAY, and
discard its remaining contents without clearing H2_CF_MUX_MFULL and
H2_CF_DEM_MROOM, then endlessly loop in h2_send(), until the watchdog
takes care of it.
What can happen is the following: Some data are received, h2_io_cb() is
called. h2_recv() is called to receive the incoming data. Then
h2_process() is called and in turn calls h2_process_demux() to process
input data. At some point, a glitch limit is reached and h2c_error() is
called to close the connection. The input frame was incomplete, so some
data are left in the demux buffer. Then h2_send() is called, which in
turn calls h2_process_mux(), which manages to queue the GOAWAY frame,
turning the state to H2_CS_ERROR2. The frame is sent, and h2_process()
calls h2_send() a last time (doing nothing) and leaves. The streams
are all woken up to notify about the error.
Multiple backend streams were waiting to be scheduled and are woken up
in turn, before their parents being notified, and communicate with the
h2 mux in zero-copy-forward mode, request a buffer via h2_nego_ff(),
fill it, and commit it with h2_done_ff(). At some point the mux's output
buffer is full, and gets flags H2_CF_MUX_MFULL.
The io_cb is called again to process more incoming data. h2_send() isn't
called (polled) or does nothing (e.g. TCP socket buffers full). h2_recv()
may or may not do anything (doesn't matter). h2_process() is called since
some data remain in the demux buf. It goes till the end, where it finds
st0 == H2_CS_ERROR2 and clears the mbuf. We're now in a situation where
the mbuf is empty and MFULL is still present.
Then it calls h2_send(), which doesn't call h2_process_mux() due to
MFULL, doesn't enter the for() loop since all buffers are empty, then
keeps sent=0, which doesn't allow to clear the MFULL flag, and since
"done" was not reset, it loops forever there.
Note that the glitches make the issue more reproducible but theoretically
it could happen with any other GOAWAY (e.g. PROTOCOL_ERROR). What makes
it not happen with the data produced on the parsing side is that we
process a single buffer of input at once, and there's no way to amplify
this to 30 buffers of responses (RST_STREAM, GOAWAY, SETTINGS ACK,
WINDOW_UPDATE, PING ACK etc are all quite small), and since the mbuf is
cleared upon every exit from h2_process() once the error was sent, it is
not possible to accumulate response data across multiple calls. And the
regular h2_snd_buf() path checks for st0 >= H2_CS_ERROR so it will not
produce any data there either.
Probably that h2_nego_ff() should check for H2_CS_ERROR before accepting
to deliver a buffer, but this needs to be carefully studied. In the mean
time the real problem is that the MFULL flag was kept when clearing the
buffer, making the two inconsistent.
Since it doesn't seem possible to trigger this sequence without the
zero-copy-forward mechanism, this fix needs to be backported as far as
2.9, along with previous commit "MINOR: mux-h2: try to clear DEM_MROOM
and MUX_MFULL at more places" which will strengthen the consistency
between these checks.
Many thanks to Annika Wickert for her detailed report that allowed to
diagnose this problem. CVE-2024-45506 was assigned to this problem.