MINOR: sink: merge sink_forward_io_handler() with sink_forward_oc_io_handler()
Now that sink_forward_oc_io_handler() and sink_forward_io_handler() were
unified again thanks to the previous commit, let's take a chance to merge
code that is common to both functions in order to ease code maintenance.
Let's add _sink_forward_io_handler() internal function which takes the
applet and a message handler as argument: sink_forward_io_handler() and
sink_forward_oc_io_handler() leverage this internal function by passing
the correct message handler for the desired format.
MINOR: sink: Remove useless test on SE_FL_SHR/SHW flags
Re-apply dcd917d972 ("MINOR: applet: Remove uselelss test on SE_FL_SHR/SHW
flags") for sink_forward_oc_io_handler() function as it was probably
overlooked given that sink_forward_oc_io_handler() and
sink_forward_io_handler() follow the same logic.
MINOR: sink: unify and sink_forward_io_handler() and sink_forward_oc_io_handler()
In a739dc2 ("MEDIUM: sink: Use the sedesc to report and detect end of
processing"), we added a drain after close in sink_forward_oc_io_handler()
by the use of "goto out".
However, since we perform a close, there is no reason to drain data from
the socket. Moreover, before the patch there was no drain and nothing
mentioned the fact that that the drain was added on purpose. Lastly,
sink_forward_io_handler() and sink_forward_oc_io_handler() functions are
strictly identical when in comes to processing logic, and the drain was
only added in sink_forward_oc_io_handler() and not in
sink_forward_io_handler().
As such, it's pretty safe to assume that the drain is not needed here
and was added as accident. So in this patch we remove it in an attempt
to unify sink_forward_io_handler() and sink_forward_oc_io_handler()
functions like it was already the case before.
BUG/MEDIUM: sink: properly init applet under sft lock
Since 09d69eacf8 ("MEDIUM: sink: start applets asynchronously") the applet
is no longer initialized under the sft lock while it was the case before.
At first it doesn't seem to be an issue, but if we look closer at
sink_forward_session_init(), we can see that sft->appctx is assigned
while it can be accessed at the same time from sink_init_forward().
Let's restore the old guarantees by performing the .init under the sft
lock.
MEDIUM: spoe/tcpcheck: Reintroduce SPOP check as a customized tcp-check
To be able to retrieve accurrate errors when a SPOP health-check is
performed, a customized tcp-check is used. Indeed, it is not possible to
rely on the SPOP multiplexer for now because the check is performed at the
mux connection layer and the error, if any, cannot be retrieved by the
health-check. A L4 success or error is reported.
To fix this issue and restore the previous behavior, a customized tcp-check
is created. The connection is forced to use the PT multiplexer. An hardcoded
message is sent and a customer handler is used to decode the SPOA response.
This way, it is possible to parse the response and return an accurrate
status code.
MINOR: spoe: Add a function to validate a version is supported
spoe_check_vsn() function can now be used to check if a version, converted
to an integer, via spoe_str_to_vsn() for instance, is supported. To do so,
the list of all supported version is now exported.
Add two initcall callback with BUG_ON_HOT() to newro and cubic modules to
ensure there is no buffer overflow when accessing the private data of
these congestion control algorithm state structures. This is to ensure
that further modifications about these data structures will not
lead to surprises. At this time there is no possible buffer overflow.
DOC: config: Explicitly state the SPOE streams have a usable parent stream
It is explicitly mentionned in the configuration manual that the parent of a
SPOE stream is the filtered stream. It means variables of the filtered
stream are usable from the SPOE stream.
DOC: config: Add info about variable scopes referencing the parent stream
It is now possible for a stream to have a parent and it is also possible to
retrieve variables defined in the parent stream context. To do so, some
extra scopes were introduced. The section 2.8. was updated accordingly.
DOC: config: Add a dedicated section about variables
The variables in the HAProxy configuration are now described in a dedicated
section. Instead of repeating the same description everywhere a variable
name can be used, the section 2.8. is now referenced.
b068e758f MINOR: quic: simplify rescheduling for handshake
This commit introduced a bad side effect. Haproxy always replied by an ACK-only
datagram when it received the first client Initial packet. Then it handled
the CRYPTO data insided. And finally, it sent its own CRYPTO data. This broke
the packet coalescing rule whose aim is to optimally build and send as more
as QUIC packets by datagram.
To fix this, simply partially reverts this commit, to make the low level I/O
task return again if some CRYPTO were received. This will delay the
acknowledgement which will be sent with the CRYPTO data from the same
datagram again.
MEDIUM: spoe: Set the parent stream for SPOE streams
When a SPOE applet is created to send a message to an agent, the parent of
the associated stream is set to the one filtered. And the relationship
between the streams is removed when the applet is released or when the
processing on main stream is finished.
In the mean time, it is possible to get variables of the parent stream from
the SPOE one. It is not a huge change but this will be amazingly useful. For
instance, it is now possible to be sticky on a server using a critera of the
main streem. Here is an example using the client source address:
stick-table type ip size 200k expire 30m
stick on var(ptxn.client_src)
server srv1 ...
server srv2 ...
server srv3 ...
server srv4 ...
Of course, the feature is not limited to stick-tables. Everywhere variables
are used, it is now possible to get the value set on the parent stream from
the SPOE stream.
MEDIUM: vars: Be able to retrieve variable of the parent stream, if any
It is now possible to retrieved the value of a variable using the parent
stream or the parent session instead of the current one. It remains
forbidden to set or unset this value. The sample fetch used to store the
result is a local copy. So it may be safely altered by a converter without
changing the value of the original variable.
Note that for now, the parent of a stream is never set. So this part is not
really used. This will change with the SPOE.
MINOR: vars: Use a variable description to get variables of a specific scope
Now a variable description is retrieved when a variable is parsed, we can
use it to get the variable value. It is mandatory to be able to know the
parent stream, if any, must be used, instead of the current one.
MEDIUM: vars: Be able to parse parent scopes for variables
Add session/stream scopes related to the parent. To do so, "psess", "ptxn",
"preq" or "pres" must be used instead of tranditionnal scopes (without the
first "p"). the "proc" scope is not concerned by this change because it is
not linked to a stream. When such scopes are used, a specific flags is added
on the variable description during the variable parsing.
For now, theses scopes are parsed and the variable description is updated
accordingly. But at the end, any operation on the variable value fails.
MINOR: vars: Use a description to set/unset a variable instead of its hash and scope
Now a variable description is retrieved when a variable is parsed, we can
use it to set or unset the variable value. It is mandatory to be able to
know the parent stream, if any, must be used, instead of the current one.
MINOR: vars: Fill a description instead of hash and scope when a name is parsed
A variable description is now used to parse a variable and extract its name
and its scope. It is mandatory to be able to add some flags on the variable
when it is evaluated (set or get). Among other things, this will be used to
know the parent stream, if any, must be used, instead of the current one.
MINOR: stream: Add a pointer to set the parent stream
A pointer to a parent stream was added in the stream structure. For now,
this pointer is never set, but the idea is to have an access to a stream
environment from another one from the moment there is a parent/child
relationship betwee these streams.
Concretely, for now, there is nothing to formalize this relationship.
BUG/MINOR: cli: Atomically inc the global request counter between CLI commands
The global request counter is used to set the stream id (s->uniq_id). It is
incremented at different places. And it must be atomically incremented
because it is a global value. However, in the analyer dealing with CLI
command response, this was not the case. It is now fixed.
This patch must be backported to all stable versions.
BUG/MINOR: server: Don't warn fallback IP is used during init-addr resolution
When a fallback IP address is provided in the list of methods to use to
resolve the server address, a warning is emitted if previous methods
failed. The aim is to inform this address will be used for the
server. However, it is valid use-case. It is the expected behavior. There is
no reason to emit a warning. Having a message during HAProxy startup to
inform the fallback IP address will be used is probably a good idea. But it
should be a notice not a warning. Otherwise, checking the configuration
validity will always failed, just like starting HAProxy in zero-warning
mode while the option was set on purpose.
This patch should fix the issue #2627. It must be backported to all stable
versions.
BUG/MINOR: stick-table: fix crash for src_inc_gpc() without stkcounter
Since 2.5, an array of GPC is provided to replace legacy gpc0/gpc1.
src_inc_gpc is a sample fetch which is used to increment counters in
this array.
A crash occurs if src_inc_gpc is used without any previous track-sc
rule. This is caused by an error in smp_fetch_sc_inc_gpc(). When
temporary stick counter is created via smp_create_src_stkctr(), table
pointer arg value used is not correct : it points to the counter ID
instead of the table argument. To fix this, use the proper sample fetch
second arg.
This can be reproduced with the following config :
acl mark src_inc_gpc(0,<table>) -m bool
tcp-request connection accept if mark
Nathan Wehrman suggested this add-on to try to better explain the
interactions between http-keep-alive and other timeouts, and the
impacts on protocols (HTTP/1, HTTP/2 etc).
MINOR: cfgparse-global: move no<poller_name> in cfg_kw_list
This commit continues to clean up cfg_parse_global() and to prepare the
refactoring of master-worker mode. Master, after forking a worker, enters in
its wait polling loop to catch signals and to provide master CLI. So, some
poller types could be disabled for master process it as well.
MINOR: cfgparse-global: move mode's keywords in cfg_kw_list
This commit cleans up cfg_parse_global() and prepares the config parser for
master-worker mode refactoring, where daemon and master-worker fork() calls
will happen very early in init().
So, the config in such case should be read twice:
- at first: only some keywords in the global section for the mode discovery
and everything, which is related to master process by opportunity;
- at second: except the master process, all other keywords would be parsed;
Let's check the second time a global counter of "ha_warning" messages, if
zero-warning is set. And let's do this just before forking. At this moment we
are sure, that we've already done all init operations, where we could emit
"ha_warning", and we still have stderr fd opened.
Even with the second check, we could lost some late and rare warnings
about failing to drop supplementary groups and about re-enabling core dumps.
Notes about this are added into 'zero-warning' keyword description.
DOC: configuration: issuers-chain-path is compatible with OCSP
Since patch f3dfd95a ("MEDIUM: ocsp: fix ocsp when the chain is loaded
from 'issuers-chain-path'") the OCSP features are compatible with
'issuers-chain-path'.
REGTESTS: ssl: test the issuers-chain-path keyword
Add a reg-test which test the completion of the issuers-chain-path
keyword
Note that it could be interesting to have the loading of a .ocsp
combined with this, but our pki for OCSP tests lacks
the SubjectKeyIdentifier extensions.
MINOR: ssl: change issuers-chain for show_cert_detail()
Since data->chain is now completed when loading the files, we don't need
to use ssl_get0_issuer_chain() anywhere else in the code.
data->chain will always be completed once the files are loaded, but we
can't know from show_cert_detail() from what chain file it was completed.
That's why the extra_chain pointer was added to dump the chain file.
MEDIUM: ocsp: fix ocsp when the chain is loaded from 'issuers-chain-path'
This fixes OCSP, when issuer chain is in a separate PEM file. This is a
case of issuers-chain-path keyword, which points to folder that contains only
PEM with RootCA and IntermediateCA.
Before this patch, the chain from 'issuers-chain-path' was applied
directly to the SSL_CTX without being applied to the data->chain
structure. This would work for SSL traffic, but every tests done with
data->chain would fail, OCSP included, because the chain would be NULL.
This patch moves the loading of the chain from
ssl_sock_load_cert_chain(), which is the function that applies the chain
to the SSL_CTX, to ssl_sock_load_pem_into_ckch() which is the function
that loads the files into the ckch_data structure.
Fixes issue #2635 but it changes thing on the CLI, so that's not
backportable.
Most of the time all sink applets (which are responsible for relaying
messages from the ring to the tcp servers endpoints) would end up being
assigned to the first available thread (tid:0), resulting in excessive CPU
usage on a single thread when multiple sink servers were defined (no
matter if they were defined over multiple "ring" sections) and significant
message load was pushed through them over the ring API.
This patch is similar to 34e4085f ("MEDIUM: peers: Balance applets across
threads") but for sinks. We use a slightly different approach, which is to
elect a random thread instead of picking the one with leasts applets. This
proves to be already sufficient to alleviate the issue.
In the case we want to have a better load distribution we should consider
breaking existing connections to reestablish them on a new thread when we
find out that they start monopolizing a cpu thread (ie: after a certain
amount of messages for instance). Also check tcpchecks migrating model for
inspiration.
This patch depends on the previous one ("MEDIUM: sink: start applets
asynchronously").
Since d9c1d33fa1 ("MEDIUM: applet: Add support for async appctx startup
on a thread subset"), it is now possible to delay appctx's init: for that
it is required that the .init callback is defined on the applet.
When the applet will be processed on the first run, applet API will
automatically finish the applet initialization. Thus we explicitly
call appctx_wakeup() on the applet to schedule it for initial run
instead of calling appctx_init() ourselves.
This is done in prevision of the next patch in order to be able to
schedule the applet on a different thread from the one executing
sink_forward_session_create() function.
Note: 'out_free_appctx' label was removed since it is no longer used.
BUG/MAJOR: mux-h2: force a hard error upon short read with pending error
A risk of truncated packet was addressed in 2.9 by commit 19fb19976f
("BUG/MEDIUM: mux-h2: Only Report H2C error on read error if demux
buffer is empty") by ignoring CO_FL_ERROR after a recv() call as long
as some data remained present in the buffer. However it has a side
effect due to the fact that some frame processors only deal with full
frames, for example, HEADERS. The side effect is that an incomplete
frame will not be processed and will remain in the buffer, preventing
the error from being taken into account, so the I/O handler wakes up
the H2 parser to handle the error, and that one just subscribes for
more data, and this loops forever wasting CPU cycles.
Note that this only happens with errors at the SSL layer exclusively,
otherwise we'd have a read0 pending that would properly be detected:
The condition to report the error in h2_recv() needs to be refined, so
that connection errors are taken into account either when the buffer is
empty, or when there's an incomplete frame, since we're certain it will
never be completed. We're certain to enter that function because
H2_CF_DEM_SHORT_READ implies too short a frame, and earlier there's a
protocol check to validate that no frame size is larger than bufsize,
hence a H2_CF_DEM_SHORT_READ implies there's some room left in the
buffer and we're allowed to try to receive.
The condition to reproduce the bug seems super hard to meet but was
observed once by Patrick Hemmer who had the reflex to capture lots of
information that allowed to explain the problem. In order to reproduce
it, the SSL code had to be significantly modified to alter received
contents at very empiric places, but that was sufficient to reproduce
it and confirm that the current patch works as expected.
The bug was tagged MAJOR because when it triggers there's no other
solution to get rid of it but to restart the process. However given how
hard it is to trigger on a lab, it does not seem very likely to occur
in field.
BUG/MEDIUM: ssl_sock: fix deadlock in ssl_sock_load_ocsp() on error path
We could run under heavy load in containers or on premises and some automatic
tool in parallel could use CLI to check OCSP updates statuses or to upload new
OCSP responses. So, calloc() to store OCSP update callback arguments may fail
and ocsp_tree_lock need to be unlocked, when exiting due to this failure.
This needs to be backported in all stable versions until v2.4.0 included.
It's usefull to keep runtime limits (fd and RAM) in postmortem and show them in
debug_parse_cli_show_dev(). Runtime limits are fed in feed_post_mortem_late(),
as we are sure that at this moment that all configuration was parsed and
all applied limits were alredy adjusted.
This is a preparation patch to extend postmortem in order to store runtime
limits. No need to perform getrlimit() in feed_post_mortem(), as we do this
in the very beginning of main() and we store initial fd limits in global
'rlim_fd_cur_at_boot' and 'rlim_fd_max_at_boot' variables.
It is more handy to use LIM2A in debug_parse_cli_show_dev(), as it allows to
show a custom string ("unlimited"), if a given limit value equals to 0.
normalize_rlim() handler is needed to convert properly RLIM_INFINITY to zero,
with the respect of type sizes, as rlim_t is always 4 bytes on 32bit and
64bit arch.
MINOR: debug: keep runtime capabilities in post_mortem
Let's extend postmortem to keep process runtime capabilities. This information
is gathered in feed_post_mortem_late(), as it is called just before
run_poll_loop() and we are sure at this moment, that all configuration
settings were successfully applied.
Let's extend post_mortem to store runtime process uid and gid.
This information is fed in feed_post_mortem_late(), just before calling
run_poll_loop(). Like this we are sure that all configuration settings were
successfully applied.
Process runtime information could be very useful in post_mortem, but we have to
collect it just before calling run_poll_loop(). Like this we are sure, that
we've successfully applied all configuration parameters and what we've
collected are the latest runtime settings.
The most appropraite place to collect such information is
feed_post_mortem_late(). It's called in each thread, but puts thread info in
the post_mortem only when it's in the last thread context. As it's called
under mutex lock, other threads at this moment have to wait until
feed_post_mortem_late() and another initialization functions from
per_thread_init_list will finish. The number of threads could be large. So, to
avoid spending a lot of time under the lock, let's exit immediately from
feed_post_mortem_late(), if it wasn't called in the last thread.
BUG/MEDIUM: debug/cli: fix "show threads" crashing with low thread counts
The "show threads" command introduced early in the 2.0 dev cycle uses
appctx->st1 to store its context (the number of the next thread to dump).
It goes back to an era where contexts were shared between the various
applets and the CLI's command handlers.
In fact it was already not good by then because st1 could possibly have
APPCTX_CLI_ST1_PAYLOAD (2) in it, that would make the dmup start at
thread 2, though it was extremely unlikely.
When contexts were finally cleaned up and moved to their own storage,
this one was overlooked, maybe due to using st1 instead of st2 like
most others. So it continues to rely on st1, and more recently some
new flags were appended, one of which is APPCTX_CLI_ST1_LASTCMD (16)
and is always there. This results in "show threads" to believe it must
start do dump from thread 16, and if this thread is not present, it can
simply crash the process. A tiny reproducer is:
global
nbthread 1
stats socket /tmp/sock1 level admin mode 666
$ socat /tmp/sock1 - <<< "show threads"
The fix for modern versions simply consists in assigning a context to
this command from the applet storage. We're using a single int, no need
for a struct, an int* will do it. That's valid till 2.6.
Prior to 2.6, better switch to appctx->ctx.cli.i0 or i1 which are all
properly initialized before the command is executed.
This must be backported to all stable versions.
Thanks to Andjelko Horvat for the report and the reproducer.
BUG/MINOR: do not close uninit FD in quic_test_socketops()
On startup, quic_test_socketops() is called to ensure that chosen
configuration option are compatible with UDP system stack. A dummy FD is
allocated to invoke various setsockopt() settings.
If no tests are required, FD is not allocated. In this case, close()
should not be close. This is mostly for better coding as this does not
cause any real issue for users.
MINOR: server: better mt_list usage for node migration (prev_deleted handling)
Now that mt_list v2 api was merged into haproxy's codebase in 4e65fc6 ("
MAJOR: import: update mt_list to support exponential back-off (try #2)"),
let's fix a hack in cli_parse_delete_server() which abused from mt_list
api to migrate an element from one list to another: there used to be a
tiny race there between the pop and the append operations, race that was
compensated by the fact that it was performed under full thread isolation.
However that was a bad example of the mt_list API which could have
resulted in actual bug if the code was duplicated elsewhere without thread
isolation. To fix this, we now make use of the
MT_LIST_FOR_EACH_ENTRY_LOCKED() macro which allows us to simply migrate
the current element to another list since the element is appended into
another one while still in busy state and then unlinked from the original
list.
MINOR: fd: don't scan the full fdtab on all threads
During tests, it's pretty visible that with many threads and a large
number of FDs, the process may take time to be ready. The reason for
this is that the full fdtab array is scanned by each and every thread
at boot in fd_reregister_all() in order to make each thread-local
poller adopt the FDs that are relevant to it. The problem is that
when dealing with 1-2M FDs and 64+ threads, it starts to represent
quite a number of loops, and usually the fdtab array doesn't entirely
fit in the CPU's L3 cache, causing extra memory accesses.
It's particularly visible when issuing debugging commands to the CLI
because usually the first one fails while the CPU is at 100% for half
a second (which also is socat's timeout). A quick test with this:
And the following script started in another window:
while ! time socat -t5 - /tmp/sock1 <<< "show version";do date -Ins;done
shows that it takes 1.58s for the socat instance that succeeds on an
Ampere Altra with 80 cores, this requires to change the timeout (defaults
to half a second) otherwise it returns nothing. In addition it also means
that during reloads, some CPU spikes will be noticed.
Adding a prefetch of the current FD + 16 improves the startup time by 30%
but that's far from being sufficient.
In practice all of this is performed at boot time, a moment at which we
know that extremely few FDs are registered (basically just the listeners),
so FD numbers are usually very low and the rest of the table is scanned
for no benefit. Ideally, knowing upfront how many FDs we have should be
sufficient.
A first approach would consist in counting the entries on a single thread
before registering pollers. It's not necessarily efficient and would take
time anyway.
This patch takes a different approach. It consists in keeping a thread-local
max ("fd_highest") that is updated whenever fd_insert() is called with a
larger number. Of course this is not correct once all threads have started,
but it will remain valid during boot since the same value is used during
startup and is cloned for each thread, and no scheduling happens anywhere
during this period, so that all threads are aware of the highest FD they've
seen registered, even if it had been done in some init code, and this without
having to deal with a shared variable.
Here on the test platform, the script gets its response in 10ms vs 1580
before.
BUILD: mux-spop: fix build failure on gcc 4-10 and clang
A label at end of block was added in mux_spop.c in function
spop_conn_update_timeout() by commit 7e1bb7283b ("MEDIUM: mux-spop:
Introduce the SPOP multiplexer"). This is normally not permitted,
so gcc-4 to 10 and clang whine about it:
CC src/mux_spop.o
src/mux_spop.c: In function 'spop_conn_update_timeout':
src/mux_spop.c:899:2: error: label at end of compound statement
899 | leave:
| ^~~~~
Let's just add a return there to make the compiler happy. No backport
is needed.
DOC: spoe: Update SPOE documentation to reflect recent refactoring
The SPOE was refactored. Several parameters were deprecated. Fragmentation
and async capabilities support were removed. The default log-format was
updated too.
So, the SPOE documentation was updated accordingly.
MEDIUM: spoe: Make the SPOE applet use its own buffers
The SPOE applet is rewritten to use its own buffers. It is not a huge change
because, once started, the only responsibility of the SPOE applet is to
transfer the ACK frame to the SPOE filter. So it means it does not send any
data to the opposite endpoint, the NOTIFY frame was already transferred
during the applet creation. And it does only receive one full frame. Once
received, it can exit.
MEDIUM: spoe: Forward SPOE context error to the SPOE applet
Errors triggered by a SPOE filter intance, mainly the processing timeout, are
now forwarded to the SPOE applet. This way, an error can be reported to the SPOP
mux stream to abort it early.
Note that, for now, no abort reaon is set because the SPOP connection is not
closed. Only the SPOP stream is aborted. But thanks to this patch, the SPOE
applet can be released immediately, instead of waiting for the ACK frame or an
error on the mux side.
MEDIUM: mux-spop: Announce the pipeling support if possible
Reintroduce the pipelining support. Everyting was alredy in place to be able
to multiplex the streams on a SPOP connection. Here, the pipelining support
is annonced and checked in the agent replies. A hard-coded limit to 20
streams is set if the pipelining is supported on both sides. Otherwise, it
is disabled and only one stream at a time is allowed.
Some conformance checks on received frames are added with this patch. Idea
is to detect invalid frames and ignore unknown ones if possible. All checks
are performed on the frame metatdata, mainly on the stream and the frame
identifiers.
MEDIUM: mux-spop/spoe: Save negociated max-frame-size value in the mux
The SPOE applet is just a pass-through now. It is no longer reponsible to
check the frame size. On the other hand, the SPOP multiplexer negociate the
maximum frame size with the agent. So, it seems logical to store this
negociated value in the mux and no longer in the applet context.
MEDIUM: spoe: Directly receive ACK frame in the SPOE context buffer
Just like the previous patch, here we avoid a buffer copy between the SPOE
applet and the SPOE filter for the ACK reply. The buffer from the SPOE
context is used to retrieve the ACK reply from the channel response buffer.
MEDIUM: spoe: Directly xfer NOTIFY frame when SPOE applet is created
Instead of using a buffer from the SPOE filter to store the NOTIFY frame, to
copy it in a trash buffer in the SPOE applet to add meta-data and then tranfer
it to the channel, the original buffer is directly transfered to the channel
during the SPOE applet creation.
The SPOE applet is thus simplied, the I/O handler is now only responsible to
retrieve the ACK reply.
MINOR: stats-html: Display reuse ratio for spop connections
Now SPOP connections can be reused, it could be pretty useful to know the
reuse rate. The corresponding backend and server counters are already
incremented, but not displayed on the stats HTML page. Thanks to this patch,
it is now possible to get it, just like for HTTP proxies.
MAJOR: mux-spop: Make the SPOP connections reusable
Thanks to this patch, SPOP connections can now be inserted in idle
connections list of the server or the session. There is no multiplexing by
SPOP connecitons can be reused. It is the same mechanics than for other
muxes. Noting really new. But it is a huge improvement.
MINOR: mux-spop: Use a dedicated function to update the SPOP connection timeout
Force the SPOP servers to use the SPOE engine identifier as pool connection
name. This way, idle SPOP connections, once implemented, of different engine
but using the same backend will not be mixed up.
MEDIUM: spoe: Force the reuse 'always' mode for SPOP backends
The reuse "always" mode is forced for SPOP backends. For now, SPOP
connections cannot be idle, but once implemented, thanks to this patch, it
will be possible to reuse SPOP connections.
MINOR: backend: Remove test on HTX streams to reuse idle connections on connect
In connect_server() function, there is a test to be able to reuse idle
connections for HTX streams only. Till now, only HTTP connections can be
idle. And this tests was added to be sure to now reuse idle connections for
legacy HTTP streams. But the legacy HTTP was removed in HAProxy-2.1. So we
can safely remove this test.
MEDIUM: spoe: Set a specific name for the connection pool of SPOP servers
With this patch, we force the connection pool name of SPOP server to the
SPOE engine identifier. This way, SPOP idle connections cannot be shared
between diffrente engines.
MINOR: spoe: Add internal sample fetch to retrieve the SPOE engine ID
The internal sample fetch "spoe.engine-id" is added. It may be used to
retrieve the current engine identifier, but only if the client endpoint is
an SPOE applet. For now, this sample is not documented. It will only be used
to set the connection pool name for a specific engine. This way, several
engine can use the same SPOP backend without sharing their idle connections.
The documentation will be added later, mainly because other SPOE sample
fetches will be added, and some changes are expected.
MAJOR: spoe: Rewrite SPOE applet to use the SPOP mux
It is the huge part of the series. The patch is not so huge, it removes
functions to produce or consume frames. The SPOE applet is pretty light
now. But since this patch, the SPOP multiplexer is now used. The SPOP mode
is now automatically ised for SPOP backends. So if there are bugs in the
SPOP multiplexer, they will be visible now.
MEDIUM: check/spoe: Use SPOP multiplexer to perform SPOP health-checks
The SPOP health-checks are now performed using the SPOP multiplexer. This
will be fixed later, but for now, it is considered as a L4 health-check and
no specific status code is reported. It means the corresponding vtest script
is marked as broken for now.
Functionnaly speaking, the same is performed. A connection is opened, a
HELLO frame is sent to the agent and we wait for the HELLO frame from the
agent in reply. But only L4OK, L4KO or L4TOUT will be reported.
It is no possible yet to use it. Idles connections and pipelining mode are
not supported for now. But it should be possible to open a SPOP connection,
perform the HELLO handshake, send a NOTIFY frame based on data produced by
the client side and receive the corresponding ACK frame to transfer its
content to the client side.
MINOR: spoe: Move all stuff regarding the filter/applet in the C file
Structures describing the SPOE applet context, the SPOE filter configuration
and context and the SPOE messages and groups are moved in the C file. In
spoe-t.h file, it remains the structure describing an SPOE agent and flags
used by both sides.
In addition, the SPOE frontend, created for a given SPOE engine, is moved
from the SPOE filter configuration to the SPOE agent structure.
MINOR: spoe: Dynamically alloc the message list per event of an agent
The inline array used to store, the configured messages per event in the
SPOE agent structure, is replaced by a dynamic array, allocated during the
configuration parsing. The main purpose of this change is to be able to move
all stuff regarding the SPOE filter and applet in the C file.
MINOR: spoe: Rename some flags and constant to use SPOP prefix
A SPOP multiplexer will be added. Many flags, constants and structures will
be remove from the applet scope. So the "SPOP" prefix is used instead of
"SPOE", to be consistent.
MINOR: stconn: Use a dedicated function to get the opposite sedesc
se_opposite() function is added to let an endpoint retrieve the opposite
endpoint descriptor. Muxes supportng the zero-copy forwarding can now use
it. The se_shutdown() function too. This will be use by the SPOP multiplexer
to be able to retrieve the SPOE agent configuration attached to the applet
on client side.
MEDIUM: applet: Add a .shut callback function for applets
Applets can now define a shutdown callback function, just like the
multiplexer. It is especially usefull to get the abort reason. This will be
pretty useful to get the status code from the SPOP stream to report it at
the SPOe filter level.
The SPOE was significantly lightened. It is now possible to refactor it to
use a dedicated multiplexer. The first step is to add a SPOP mode for
proxies. The corresponding multiplexer mode is also added.
For now, there is no SPOP multiplexer, so it is only declarative. But at the
end, the SPOP multiplexer will be automatically selected for servers inside
a SPOP backend.
MAJOR: spoe: Remove idle applets and pipelining support
Management of idle applets is removed. Consequently, the pipelining support
is also removed. It is a huge change but it should be transparent for the
agents, except regarding the performances. Of course, being able to reuse
already openned connections and being able to multiplex frames on a given
connection is a must have. These features will be restored later.
hello and idle timeout are not longer used. Because an applet is spawned to
process a NOTIFY frame and closed after receiving the ACK reply, the
processing timeout is the only one required. In addition, the parameters to
limit the SPOE applet creation are no longer used too.
All the SPOE debugging is removed. The code will be easier to rework this
way and the debugging will be mainly moved in the SPOP multiplexter via the
trace API.
MINOR: spoe: Use only a global engine-id per agent
Because the async mode was removed, it is no longer mandatory to announce a
different engine identifiers per thread for a given SPOE agent. This was
used to be sure requests and the corresponding responses are stuck on the
same thread.
So, now, a SPOE agent only announces one engine identifier on all
connections. No changes should be expected for agents.
The support for asynchronous mode, the ability to send messages on a
connection and receive the responses on any other connections, is removed.
It appears this feature was a bit overkill. And it is a problem for this
refactoring. This feature is removed and will not be restored at the end.
It is not a big deal for agent supporting the async mode because it is
usable if it is announced on both sides. HAProxy stops to announce it. This
should be transparent for agents.
It is the first patch of a long series to refactor the SPOE filter. The idea
is to rely on a dedicated multiplexer instead of hakcing HAProxy with a list
of applets processing a message queue.
First of all, optionnal features will be removed. Some will be restored at
the end, some others will just be removed. It is the case here. The frame
fragmentation support is removed. The only purpose of this feature is to be
able to support the streaming. Because it is out of the scope of this
refactoring, the fragmentation is removed.
BUG/MINOR: session: Eval L4/L5 rules defined in the default section
It is possible to define TCP/HTTP rules in a named default section to
inherit from it in a proxy. However, there is an issue with L4/L5 rules.
Only the lists of the current frontend are checked to know if an eval must
be performed. Nothing is done for an empty list. Of course, the lists of the
default proxy must also be checked to be sure to not ignored default L4/L5
rules. It is now fixed.
This patch should fix the issue #2637. It must be backported as far as 2.6.
This commit is the renaming counterpart of the previous one, this time
for quic_conn module. Several elements related to TID affinity update
from quic_conn has been renamed : public functions, but also flag
renamed to QUIC_FL_CONN_TID_REBIND and trace event to
QUIC_EV_CONN_BIND_TID.
This should be backported with the same instruction as the previous
commit.
Since the following patch, protocol API to update a connection TID
affinity has been extended.
commit 1a43b9f32c71267e3cb514aa70a13c75adb20742
MINOR: proto: extend connection thread rebind API
The single callback set_affinity has been splitted in 3 different
functions which are called at different stages during listener_accept(),
depending on accept queue push success or not. However, the naming was
rendered confusing by the usage of function prefix 1 and 2.
Rename proto callback related to TID affinity update and use the
following names :
This commit should probably be backported at least up to 3.0 with the
above patch. This is because the fix was recently backported and it
would allow to keep changes minimal between the two versions. It could
even be backported up to 2.8 if there is no major conflict.
BUG/MEDIUM: bwlim: Be sure to never set the analyze expiration date in past
Every time a bandwidth limitation is evaluated on a channel, the analyze
expiration date is renewed, mainly based on the internal bandwidth
limitation filter expiration date. However, when the filter is called while
there is no data to filter, we skip all limitation computations to jump at
the end of the function. At this stage, the analyze expiration date is
renewed before exiting. But here the internal expiration date may be expired
and not reset.
To sum up, it is possible to set the analyze expiration date of a channel in
the past. It is unexpected and this could lead to a loop in process_stream.
To fix the issue, we just now take care to reset the internal expiration
date, if needed, before exiting.
This patch should fix the issue #2634. It must be backported as far as 2.8.
MINOR: quic: add counters of sent bytes with and without GSO
Add a sent bytes counter for each quic_conn instance. A secondary field
which only account bytes sent via GSO which is useful to ensure if this
is activated.
For the moment, these counters are reported on "show quic" but not
aggregated on proxy quic module stats.
UDP GSO on Linux is not implemented in every network devices. For
example, this is not available for veth devices frequently used in
container environment. In such case, EIO is reported on send()
invocation.
It is impossible to test at startup for proper GSO support in this case
as a listener may be bound on multiple network interfaces. Furthermore,
network interfaces may change during haproxy lifetime.
As such, the only option is to react on send syscall error when GSO is
used. The purpose of this patch is to implement a fallback when
encountering such conditions. Emission can be retried immediately by
trying to send each prepared datagrams individually.
To support this, qc_send_ppkts() is able to iterate over each datagram
in a so-called non-GSO fallback mode. Between each emission, a datagram
header is rewritten in front of the buffer which allows the sending loop
to proceed until last datagram is emitted.
To complement this, quic_conn listener is flagged on first GSO send
error with value LI_F_UDP_GSO_NOTSUPP. This completely disables GSO for
all future emission with QUIC connections using this listener.
For the moment, non-GSO fallback mode is activated when EIO is reported
after GSO has been set. This is the error reported for the veth usage
described above.
Amaury Denoyelle [Thu, 30 May 2024 13:15:14 +0000 (15:15 +0200)]
MAJOR: quic: support GSO when encoding datagrams
QUIC datagrams are encoded during emission via the function
qc_prep_pkts(). By default, if GSO is not used, each datagram is
prefixed by a metadata header which specify its length and address of
its first quic_tx_packet instance.
If GSO is activated, metadata header won't be inserted for datagrams
following the first one sent in a single syscall. Length field will
contain the total size of these datagrams. This allows to support both
GSO and non-GSO prepared datagram in the same Tx buffer.
qc_send_ppkts() is invoked just after datagrams encoding. It iterates
over each metadata header in Tx buffer to sent each datagram
individually. If length field is bigger than network MTU, GSO usage is
assumed and qc_snd_buf() GSO parameter will be set.
Another important point to note regarding GSO implementation is that
during datagram encoding, packets from the same datagram instance are
attached together. However, if using GSO, consecutive packets from
different datagrams are also linked, but without
QUIC_FL_TX_PACKET_COALESCED flag. This allows to properly update
quic_conn status with all sent packets in qc_send_ppkts(). Packets from
different datagrams are then unlinked to treat them separately when
receiving corresponding ACK frames.
Amaury Denoyelle [Tue, 28 May 2024 13:04:45 +0000 (15:04 +0200)]
MINOR: quic: add GSO parameter on quic_sock send API
Add <gso_size> parameter to qc_snd_buf(). When non-null, this specifies
the value for socket option SOL_UDP/UDP_SEGMENT. This allows to send
several datagrams in a single call by splitting data multiple times at
<gso_size> boundary.
For now, <gso_size> remains set to 0 by caller, as such there should not
be any functional change.
Amaury Denoyelle [Thu, 30 May 2024 12:49:46 +0000 (14:49 +0200)]
MINOR: quic: define quic_cc_path MTU as constant
Future commits will implement GSO support to be able to emit multiple
datagrams in a single syscall invocation. This will be used every time
there is more data to sent than the UDP network MTU.
No change will be done for Tx buffer encoding, in particular when using
extra metadata datagram header. When GSO will be used, length field will
contain the total length of all datagrams to emit in a single GSO
syscall send. As such, QUIC send functions will detect that GSO is in
use if total length is greater than MTU.
This last assumption forces to ensure that MTU is constant. Indeed, in
case qc_send() is interrupted, Tx buffer will be left with prepared
datagrams. These datagrams will be emitted at the next qc_send()
invocation. If MTU would change during these two calls, it would be
impossible to know if GSO was used or not. To prevent this, mark <mtu>
field of quic_cc_path as constant.
Amaury Denoyelle [Thu, 30 May 2024 12:49:25 +0000 (14:49 +0200)]
MINOR: quic: activate UDP GSO for QUIC if supported
Add a startup test for GSO support in quic_test_socketopts() and
automatically activate it in qc_prep_pkts() when building datagrams as
big as MTU.
Also define a new config option tune.quic.disable-udp-gso. This is
useful to prevent warning on older platform or to debug an issue which
may be related to GSO.
Amaury Denoyelle [Wed, 26 Jun 2024 15:09:08 +0000 (17:09 +0200)]
MINOR: quic: extend detection of UDP API OS features
QUIC haproxy implementation relies on specific OS features to activate
some UDP optimization. One of these is the ability to bind multiple
sockets on the same address, which is necessary to have a dedicated
socket for each QUIC connections. This feature support is tested during
startup via an internal proto-quic function. It automatically deactivate
socket per connection if OS is not compatible.
The purpose of this patch is to render this QUIC feature detection code
more generic. Function is renamed quic_test_socketopts() and is still
invoked on startup. Its internal code has been refactored to be able to
implement other features support test in it.
Return value has also been changed and is now taken into account. In
case of ERR_FATAL, haproxy startup will be interrupted. This happens on
socket() syscall failure used to duplicate a QUIC listener FD.
This commit will become necessary to detect GSO support on startup.