It is no possible yet to use it. Idles connections and pipelining mode are
not supported for now. But it should be possible to open a SPOP connection,
perform the HELLO handshake, send a NOTIFY frame based on data produced by
the client side and receive the corresponding ACK frame to transfer its
content to the client side.
MINOR: spoe: Move all stuff regarding the filter/applet in the C file
Structures describing the SPOE applet context, the SPOE filter configuration
and context and the SPOE messages and groups are moved in the C file. In
spoe-t.h file, it remains the structure describing an SPOE agent and flags
used by both sides.
In addition, the SPOE frontend, created for a given SPOE engine, is moved
from the SPOE filter configuration to the SPOE agent structure.
MINOR: spoe: Dynamically alloc the message list per event of an agent
The inline array used to store, the configured messages per event in the
SPOE agent structure, is replaced by a dynamic array, allocated during the
configuration parsing. The main purpose of this change is to be able to move
all stuff regarding the SPOE filter and applet in the C file.
MINOR: spoe: Rename some flags and constant to use SPOP prefix
A SPOP multiplexer will be added. Many flags, constants and structures will
be remove from the applet scope. So the "SPOP" prefix is used instead of
"SPOE", to be consistent.
MINOR: stconn: Use a dedicated function to get the opposite sedesc
se_opposite() function is added to let an endpoint retrieve the opposite
endpoint descriptor. Muxes supportng the zero-copy forwarding can now use
it. The se_shutdown() function too. This will be use by the SPOP multiplexer
to be able to retrieve the SPOE agent configuration attached to the applet
on client side.
MEDIUM: applet: Add a .shut callback function for applets
Applets can now define a shutdown callback function, just like the
multiplexer. It is especially usefull to get the abort reason. This will be
pretty useful to get the status code from the SPOP stream to report it at
the SPOe filter level.
The SPOE was significantly lightened. It is now possible to refactor it to
use a dedicated multiplexer. The first step is to add a SPOP mode for
proxies. The corresponding multiplexer mode is also added.
For now, there is no SPOP multiplexer, so it is only declarative. But at the
end, the SPOP multiplexer will be automatically selected for servers inside
a SPOP backend.
MAJOR: spoe: Remove idle applets and pipelining support
Management of idle applets is removed. Consequently, the pipelining support
is also removed. It is a huge change but it should be transparent for the
agents, except regarding the performances. Of course, being able to reuse
already openned connections and being able to multiplex frames on a given
connection is a must have. These features will be restored later.
hello and idle timeout are not longer used. Because an applet is spawned to
process a NOTIFY frame and closed after receiving the ACK reply, the
processing timeout is the only one required. In addition, the parameters to
limit the SPOE applet creation are no longer used too.
All the SPOE debugging is removed. The code will be easier to rework this
way and the debugging will be mainly moved in the SPOP multiplexter via the
trace API.
MINOR: spoe: Use only a global engine-id per agent
Because the async mode was removed, it is no longer mandatory to announce a
different engine identifiers per thread for a given SPOE agent. This was
used to be sure requests and the corresponding responses are stuck on the
same thread.
So, now, a SPOE agent only announces one engine identifier on all
connections. No changes should be expected for agents.
The support for asynchronous mode, the ability to send messages on a
connection and receive the responses on any other connections, is removed.
It appears this feature was a bit overkill. And it is a problem for this
refactoring. This feature is removed and will not be restored at the end.
It is not a big deal for agent supporting the async mode because it is
usable if it is announced on both sides. HAProxy stops to announce it. This
should be transparent for agents.
It is the first patch of a long series to refactor the SPOE filter. The idea
is to rely on a dedicated multiplexer instead of hakcing HAProxy with a list
of applets processing a message queue.
First of all, optionnal features will be removed. Some will be restored at
the end, some others will just be removed. It is the case here. The frame
fragmentation support is removed. The only purpose of this feature is to be
able to support the streaming. Because it is out of the scope of this
refactoring, the fragmentation is removed.
BUG/MINOR: session: Eval L4/L5 rules defined in the default section
It is possible to define TCP/HTTP rules in a named default section to
inherit from it in a proxy. However, there is an issue with L4/L5 rules.
Only the lists of the current frontend are checked to know if an eval must
be performed. Nothing is done for an empty list. Of course, the lists of the
default proxy must also be checked to be sure to not ignored default L4/L5
rules. It is now fixed.
This patch should fix the issue #2637. It must be backported as far as 2.6.
This commit is the renaming counterpart of the previous one, this time
for quic_conn module. Several elements related to TID affinity update
from quic_conn has been renamed : public functions, but also flag
renamed to QUIC_FL_CONN_TID_REBIND and trace event to
QUIC_EV_CONN_BIND_TID.
This should be backported with the same instruction as the previous
commit.
Since the following patch, protocol API to update a connection TID
affinity has been extended.
commit 1a43b9f32c71267e3cb514aa70a13c75adb20742
MINOR: proto: extend connection thread rebind API
The single callback set_affinity has been splitted in 3 different
functions which are called at different stages during listener_accept(),
depending on accept queue push success or not. However, the naming was
rendered confusing by the usage of function prefix 1 and 2.
Rename proto callback related to TID affinity update and use the
following names :
This commit should probably be backported at least up to 3.0 with the
above patch. This is because the fix was recently backported and it
would allow to keep changes minimal between the two versions. It could
even be backported up to 2.8 if there is no major conflict.
BUG/MEDIUM: bwlim: Be sure to never set the analyze expiration date in past
Every time a bandwidth limitation is evaluated on a channel, the analyze
expiration date is renewed, mainly based on the internal bandwidth
limitation filter expiration date. However, when the filter is called while
there is no data to filter, we skip all limitation computations to jump at
the end of the function. At this stage, the analyze expiration date is
renewed before exiting. But here the internal expiration date may be expired
and not reset.
To sum up, it is possible to set the analyze expiration date of a channel in
the past. It is unexpected and this could lead to a loop in process_stream.
To fix the issue, we just now take care to reset the internal expiration
date, if needed, before exiting.
This patch should fix the issue #2634. It must be backported as far as 2.8.
MINOR: quic: add counters of sent bytes with and without GSO
Add a sent bytes counter for each quic_conn instance. A secondary field
which only account bytes sent via GSO which is useful to ensure if this
is activated.
For the moment, these counters are reported on "show quic" but not
aggregated on proxy quic module stats.
UDP GSO on Linux is not implemented in every network devices. For
example, this is not available for veth devices frequently used in
container environment. In such case, EIO is reported on send()
invocation.
It is impossible to test at startup for proper GSO support in this case
as a listener may be bound on multiple network interfaces. Furthermore,
network interfaces may change during haproxy lifetime.
As such, the only option is to react on send syscall error when GSO is
used. The purpose of this patch is to implement a fallback when
encountering such conditions. Emission can be retried immediately by
trying to send each prepared datagrams individually.
To support this, qc_send_ppkts() is able to iterate over each datagram
in a so-called non-GSO fallback mode. Between each emission, a datagram
header is rewritten in front of the buffer which allows the sending loop
to proceed until last datagram is emitted.
To complement this, quic_conn listener is flagged on first GSO send
error with value LI_F_UDP_GSO_NOTSUPP. This completely disables GSO for
all future emission with QUIC connections using this listener.
For the moment, non-GSO fallback mode is activated when EIO is reported
after GSO has been set. This is the error reported for the veth usage
described above.
Amaury Denoyelle [Thu, 30 May 2024 13:15:14 +0000 (15:15 +0200)]
MAJOR: quic: support GSO when encoding datagrams
QUIC datagrams are encoded during emission via the function
qc_prep_pkts(). By default, if GSO is not used, each datagram is
prefixed by a metadata header which specify its length and address of
its first quic_tx_packet instance.
If GSO is activated, metadata header won't be inserted for datagrams
following the first one sent in a single syscall. Length field will
contain the total size of these datagrams. This allows to support both
GSO and non-GSO prepared datagram in the same Tx buffer.
qc_send_ppkts() is invoked just after datagrams encoding. It iterates
over each metadata header in Tx buffer to sent each datagram
individually. If length field is bigger than network MTU, GSO usage is
assumed and qc_snd_buf() GSO parameter will be set.
Another important point to note regarding GSO implementation is that
during datagram encoding, packets from the same datagram instance are
attached together. However, if using GSO, consecutive packets from
different datagrams are also linked, but without
QUIC_FL_TX_PACKET_COALESCED flag. This allows to properly update
quic_conn status with all sent packets in qc_send_ppkts(). Packets from
different datagrams are then unlinked to treat them separately when
receiving corresponding ACK frames.
Amaury Denoyelle [Tue, 28 May 2024 13:04:45 +0000 (15:04 +0200)]
MINOR: quic: add GSO parameter on quic_sock send API
Add <gso_size> parameter to qc_snd_buf(). When non-null, this specifies
the value for socket option SOL_UDP/UDP_SEGMENT. This allows to send
several datagrams in a single call by splitting data multiple times at
<gso_size> boundary.
For now, <gso_size> remains set to 0 by caller, as such there should not
be any functional change.
Amaury Denoyelle [Thu, 30 May 2024 12:49:46 +0000 (14:49 +0200)]
MINOR: quic: define quic_cc_path MTU as constant
Future commits will implement GSO support to be able to emit multiple
datagrams in a single syscall invocation. This will be used every time
there is more data to sent than the UDP network MTU.
No change will be done for Tx buffer encoding, in particular when using
extra metadata datagram header. When GSO will be used, length field will
contain the total length of all datagrams to emit in a single GSO
syscall send. As such, QUIC send functions will detect that GSO is in
use if total length is greater than MTU.
This last assumption forces to ensure that MTU is constant. Indeed, in
case qc_send() is interrupted, Tx buffer will be left with prepared
datagrams. These datagrams will be emitted at the next qc_send()
invocation. If MTU would change during these two calls, it would be
impossible to know if GSO was used or not. To prevent this, mark <mtu>
field of quic_cc_path as constant.
Amaury Denoyelle [Thu, 30 May 2024 12:49:25 +0000 (14:49 +0200)]
MINOR: quic: activate UDP GSO for QUIC if supported
Add a startup test for GSO support in quic_test_socketopts() and
automatically activate it in qc_prep_pkts() when building datagrams as
big as MTU.
Also define a new config option tune.quic.disable-udp-gso. This is
useful to prevent warning on older platform or to debug an issue which
may be related to GSO.
Amaury Denoyelle [Wed, 26 Jun 2024 15:09:08 +0000 (17:09 +0200)]
MINOR: quic: extend detection of UDP API OS features
QUIC haproxy implementation relies on specific OS features to activate
some UDP optimization. One of these is the ability to bind multiple
sockets on the same address, which is necessary to have a dedicated
socket for each QUIC connections. This feature support is tested during
startup via an internal proto-quic function. It automatically deactivate
socket per connection if OS is not compatible.
The purpose of this patch is to render this QUIC feature detection code
more generic. Function is renamed quic_test_socketopts() and is still
invoked on startup. Its internal code has been refactored to be able to
implement other features support test in it.
Return value has also been changed and is now taken into account. In
case of ERR_FATAL, haproxy startup will be interrupted. This happens on
socket() syscall failure used to duplicate a QUIC listener FD.
This commit will become necessary to detect GSO support on startup.
Let's encapsulate the check of all supported for now process internal limits in
a separate function. This will help in cases, when we need to simply check if
we have even only one limit set in the configuration file. It's important, as
the default value for a one limit (fd-hard-limits, for example) sometimes must
not affect the computation of the others.
MINOR: haproxy: prepare to move limits-related code
This patch is done in order to prepare the move of handlers to compute and to
check process related limits as maxconn, maxsock, maxpipes.
So, these handlers become no longer static due to the future move.
We add the handlers declarations in limits.h in this patch as well, in order to
keep the next patch, dedicated to code replacement, without any additional
modifications.
Such split also assures that this patch can be compiled separately from the
next one, where we moving the handlers. This is important in case of
git-bisect.
As raise_rlim_nofile() was moved to limits compilation unit, limits.h includes
the system <sys/resource.h>. So, this definition of rlimit system type
structure is no longer need for compilation of fd unit.
MINOR: limits: prepare to keep limits in one place
The code which gets, sets and checks initial and current fd limits and process
related limits (maxconn, maxsock, ulimit-n, fd-hard-limit) is spread around
different functions in haproxy.c and in fd.c. Let's group it together in
dedicated limits.c and limits.h.
This patch is done in order to prepare the moving of limits-related functions
from different places to the new 'limits' compilation unit. It helps to keep
clean the next patch, which will do only the move without any additional
modifications.
Such detailed split is needed in order to be sure not to break accidentally
limits logic and in order to be able to compile each commit separately in case
of git-bisect.
Released version 3.1-dev3 with the following main changes :
- BUG/MINOR: quic: Wrong datagram building when probing.
- BUG/MEDIUM: quic: fix possible exit from qc_check_dcid() without unlocking
- BUG/MINOR: promex: Remove Help prefix repeated twice for each metric
- DOC: configuration: add details about crt-store in bind "crt" keyword
- BUG/MEDIUM: hlua/cli: Fix lua CLI commands to work with applet's buffers
- DOC: configuration: more details about the master-worker mode
- BUG/MEDIUM: server: fix race on server_atomic_sync()
- BUG/MINOR: jwt: don't try to load files with HMAC algorithm
- CLEANUP: quic: cleanup prototypes related to CIDs handling
- CLEANUP: quic: remove non-existing quic_cid_tree definition
- MINOR: quic: remove access to CID global tree outside of quic_cid module
- REORG: quic: remove quic_cid_trees reference from proto_quic
- MINOR: quic: add 2 BUG_ON() on datagram dispatch
- MINOR: quic: ensure quic_conn is never removed on thread affinity rebind
- MEDIUM: init: set default for fd_hard_limit via DEFAULT_MAXFD
- DOC: configuration: update maxconn description
- MINOR: proto: extend connection thread rebind API
- BUG/MEDIUM: quic: prevent crash on accept queue full
- BUG/MEDIUM: peers: Fix crash when syncing learn state of a peer without appctx
- CI: add weekly QUIC Interop regression against LibreSSL
- DEV: flags/quic: decode quic_conn flags
- MINOR: quic: rename "ssl error" trace
- BUG/MEDIUM: init: fix fd_hard_limit default in compute_ideal_maxconn
- BUG/MINOR: jwt: fix variable initialisation
- MINOR: ssl/sample: ssl_c_san returns a comma separated list of SAN
- OPTIM: pool: improve needed_avg cache line access pattern
- MAJOR: import: update mt_list to support exponential back-off (try #2)
- CI: weekly QUIC Interop: try to fix private image
- BUG/MINOR: h1: Fail to parse empty transfer coding names
- BUG/MINOR: h1: Reject empty coding name as last transfer-encoding value
- BUG/MEDIUM: h1: Reject empty Transfer-encoding header
- BUG/MEDIUM: spoe: Be sure to create a SPOE applet if none on the current thread
- BUILD: listener: silence a build warning about unused value without threads
- DOC: architecture: remove the totally outdated architecture manual
- SCRIPTS: create-release: no more need to skip architecture.txt
DOC: architecture: remove the totally outdated architecture manual
We've discussed about removing it many times and I thought it had been
removed long ago, but apparently not as William proved me. Let's get
rid of it now. It's totally outdated (last updated 18 years ago, when
laptop processors were still 32 bits), mentions keywords and external
products that don't exist anymore. It's not even on docs.haproxy.org.
At some point, old stuff must really die.
BUILD: listener: silence a build warning about unused value without threads
A variable introduced in commit 1a43b9f32c ("MINOR: proto: extend
connection thread rebind API") is not used without threads and causes a
build warning. Let's just mark it maybe_unused.
Since the commit above is tagged for backporting, this one will need to
be backported along with it.
BUG/MEDIUM: spoe: Be sure to create a SPOE applet if none on the current thread
When a message is queued, waiting to be processed by a SPOE applet, there
are some heuristic to know if a new applet must be created or not. There are
2 conditions to skip the applet creation:
1 - if there are enough idle applets on the current thread, or,
2 - if the processing rate on the current thread is high enough to handle
this new message
In the 2nd case, there is a flaw when the number of processed messages falls
to zero while the processing rate is still greater than zero. In that case,
we will skip the SPOE applet creation without taking care to check there is
at least one applet on the current thread.
So now, the conditions above to skip the SPOE applet creation are only
evaluated if there is at least one applet on the current thread.
This patch must be backported to every stable versions.
The Transfer-Encoding headers list the transfer coding that have been
applied to the content in order to form the message body. It is a list of
tokens. And as specified by RFC 9110, a token cannot be empty. When several
coding names are specify as a comma-separated value, this case is properly
handled and an error is triggered. However, an empty header value will just
be skipped and no error is triggered. This could be an issue with some buggy
servers.
Now, empty Transfer-Encoding header are rejected too.
BUG/MINOR: h1: Fail to parse empty transfer coding names
Empty transfer coding names, inside a comma-separated list, are already
rejected. But it is only by chance. Today, it is detected as an unknown
coding names (not "chunked" concretly). Then, it is handled by the H1
multiplexer as an error and a 422-Unprocessable-Content response is
returned.
So, the error is properly detected in this case, but it is not accurate. A
400-bad-request response must be returned instead. Then, it is better to
catch the error during the header parsing. It is the purpose of this patch.
Ilia Shipitsin [Tue, 9 Jul 2024 13:03:49 +0000 (15:03 +0200)]
CI: weekly QUIC Interop: try to fix private image
for some reason image built in HAProxy workflow is "private", it
is succesfully built, but fails to pull. Let's try explicit docker login
for run job as well
Willy Tarreau [Thu, 30 May 2024 09:27:32 +0000 (11:27 +0200)]
MAJOR: import: update mt_list to support exponential back-off (try #2)
This is the second attempt at importing the updated mt_list code (commit 59459ea3). The previous one was attempted with commit c618ed5ff4 ("MAJOR:
import: update mt_list to support exponential back-off") but revealed
problems with QUIC connections and was reverted.
The problem that was faced was that elements deleted inside an iterator
were no longer reset, and that if they were to be recycled in this form,
they could appear as busy to the next user. This was trivially reproduced
with this:
This was purely an API issue caused by the simplified usage of the macros
for the iterator. The original version had two backups (one full element
and one pointer) that the user had to take care of, while the new one only
uses one that is transparent for the user. But during removal, the element
still has to be unlocked if it's going to be reused.
It was also found that plenty of places just want to free the current
element, or delete it to do anything with it, hence don't need to reset
its pointers (e.g. event_hdl). Finally it appeared obvious that the
root cause of the problem was the unclear usage of the list iterators
themselves because one does not necessarily expect the element to be
presented locked when not needed, which makes the unlock easy to overlook
during reviews.
The updated version of the list presents explicit lock status in the
macro name (_LOCKED or _UNLOCKED suffixes). When using the _LOCKED
suffix, the caller is expected to unlock the element if it intends to
reuse it. At least the status is advertised. The _UNLOCKED variant,
instead, always unlocks it before starting the loop block. This means
it's not necessary to think about unlocking it, though it's obviously
not usable with everything. A few _UNLOCKED were used at obvious places
(i.e. where the element is deleted and freed without any prior check).
Interestingly, the tests performed last year on QUIC forwarding, that
resulted in limited traffic for the original version and higher bit
rate for the new one couldn't be reproduced because since then the QUIC
stack has gaind in efficiency, and the 100 Gbps barrier is now reached
with or without the mt_list update. However the unit tests definitely
show a huge difference, particularly on EPYC platforms where the EBO
provides tremendous CPU savings.
Overall, the following changes are visible from the application code:
- mt_list_for_each_entry_safe() + 1 back elem + 1 back ptr
=> MT_LIST_FOR_EACH_ENTRY_LOCKED() or MT_LIST_FOR_EACH_ENTRY_UNLOCKED()
+ 1 back elem
- MT_LIST_DELETE_SAFE() no longer needed in MT_LIST_FOR_EACH_ENTRY_UNLOCKED()
=> just manually set iterator to NULL however.
For MT_LIST_FOR_EACH_ENTRY_LOCKED()
=> mt_list_unlock_self() (if element going to be reused) + NULL
OPTIM: pool: improve needed_avg cache line access pattern
On an AMD EPYC 3rd gen, 20% of the CPU is spent calculating the amount
of pools needed when using QUIC, because pool allocations/releases are
quite frequent and the inter-CCX communication is super slow. Still,
there's a way to save between 0.5 and 1% CPU by using fetch-add and
sub-fetch that are converted to XADD so that the result is directly
fed into the swrate_add argument without having to re-read the memory
area. That's what this patch does.
MINOR: ssl/sample: ssl_c_san returns a comma separated list of SAN
The ssl_c_san sample fetch returns a list of Subject Alt Name which was
presented by the client certificate.
The format is the same as the "openssl x509 -text" command, it's a
Description: Value list separated by commas.
The format is directly generated by the GENERAL_NAME_print() openssl
function.
BUG/MEDIUM: init: fix fd_hard_limit default in compute_ideal_maxconn
This commit fixes 41275a691 ("MEDIUM: init: set default for fd_hard_limit via
DEFAULT_MAXFD").
fd_hard_limit is taken in account implicitly via 'ideal_maxconn' value in
all maxconn adjustements, when global.rlimit_memmax is set:
MIN(global.maxconn, capped by global.rlimit_memmax, ideal_maxconn);
It also caps provided global.rlimit_nofile, if it couldn't be set as a current
process fd limit (see more details in the main() code).
So, lets set the default value for fd_hard_limit only, when there is no any
other haproxy-specific limit provided, i.e. rlimit_memmax, maxconn,
rlimit_nofile. Otherwise we may break users configs.
Please, note, that in master-worker mode, master does not need the
DEFAULT_MAXFD (1048576) as well, as we explicitly limit its maxconn to 100.
Must be backported in all stable versions until v2.6.0, including v2.6.0,
like the commit above.
Amaury Denoyelle [Thu, 20 Jun 2024 15:54:33 +0000 (17:54 +0200)]
MINOR: quic: rename "ssl error" trace
SSL status is reported each time quic_conn_io_cb() is finished via a
trace. Change the trace label from "ssl error" to "ssl status". This
allows to search for errors easier without being distracted by this
trace.
BUG/MEDIUM: peers: Fix crash when syncing learn state of a peer without appctx
For a given peer, the synchronization of the learn state is no longer
performed in the peer appctx. It is delayed to be handled by the peers sync
task. It means that for a given peer, it is possible to have finished to
learn and only handle it after the appctx release. So the synchronization
may happen on a peer without appctx.
This was not tested and an unconditionnal wakeup on the appctx could lead to
a crash because of a NULL-deref. It may be experienced by running
reg-tests/peers/tls_basic_sync.vtc script in loop. The fix is obivous. In
sync_peer_learn_state(), we must omit to wakeup the appctx if it was already
released.
This patch should fix issue #2629. It must be backported to 3.0.
BUG/MEDIUM: quic: prevent crash on accept queue full
Handshake for quic_conn instances runs on a single non-chosen thread. On
completion, listener_accept() is performed to select the less loaded
thread before initializing connection instance. As such, quic_conn
instance is migrated to the thread with its upper connection.
In case accept queue is full, listener_accept() fallback to local accept
mode, which cause the connection to be assigned to the current thread.
However, this is not supported by QUIC as quic_conn instance is left on
the previously selected thread. In most cases, this will cause a
BUG_ON() due to a task manipulation from an outside thread.
To fix this, handle quic_conn thread rebind in multiple steps using the
new extended protocol API. Several operations have been moved from
qc_set_tid_affinity1() to newly defined qc_set_tid_affinity2(), in
particular CID TID update. This ensures that quic_conn instance is not
prematurely accessed on the new thread until accept queue push is
guaranteed to succeed.
qc_reset_tid_affinity() is also newly defined to reassign the newly
created tasks and tasklets to the current thread. This is necessary to
prevent the BUG_ON() crash described above.
This must be backported up to 2.8 after a period of observation. Note
that it depends on previous patch :
MINOR: proto: extend connection thread rebind API
MINOR: listener: define callback for accept queue push
Extend API for connection thread rebind API by replacing single callback
set_affinity by three different ones. Each one of them is used at a
different stage of the operation :
* set_affinity1 is used similarly to previous set_affinity
* set_affinity2 is called directly from accept_queue_push_mp() when an
entry has been found in accept ring. This operation cannot fail.
* reset_affinity is called after set_affinity1 in case of failure from
accept_queue_push_mp() due to no space left in accept ring. This is
necessary for protocols which must reconfigure resources before
fallback on the current tid.
This patch does not have any functional changes. However, it will be
required to fix crashes for QUIC connections when accept queue ring is
full. As such, it must be backported with it.
Let's update maxconn keyword description, in order to make it clear, which
setting has the precedence over the global.maxconn and the SYSTEM_MAXCONN if
set.
MEDIUM: init: set default for fd_hard_limit via DEFAULT_MAXFD
Let's provide a default value for fd_hard_limit, if it's not set in the
configuration. With this patch we could set some specific default via
compile-time variable DEFAULT_MAXFD as well. Hope, this will be helpfull for
haproxy package maintainers.
make -j 8 TARGET=linux-glibc DEBUG=-DDEFAULT_MAXFD=50000
If haproxy is comipled without DEFAULT_MAXFD defined, the default will be set
to 1048576.
This is done to avoid killing the process by its watchdog, while it started
without any limitations in its configuration or in the command line and the
hard RLIMIT_NOFILE is extremely huge (~1000000000). We use in this case
compute_ideal_maxconn() to calculate maxconn and maxsock, maxsock defines the
size of internal fdtab, which becames very-very large as well. When
the process starts to simply loop over this fdtab (0(n)), this takes a lot of
time, so watchdog does it job.
To avoid this, maxconn now is always reduced to some reasonable value either
by explicit global.fd-hard-limit from configuration, or by its default. The
default may be changed at build-time and overwritten then by
global.fd-hard-limit at runtime. Explicit global.fd-hard-limit from the
configuration has always precedence over DEFAULT_MAXFD, if set.
Must be backported in all stable versions until v2.6.0, including v2.6.0.
MINOR: quic: ensure quic_conn is never removed on thread affinity rebind
On accept, quic_conn instance is migrated from its original thread to a
new one. This operation is conducted in two steps, on the original than
the new thread instance. During the interval, quic_conn is artificially
rendered inactive. It must never be accessed nor removed until migration
is completed via qc_finalize_affinity_rebind(). This new BUG_ON() will
enforce that removal is never conducted until migration is completed.
QUIC datagram dispatch is an error prone operation as it must always
ensure the correct thread is used before accessing to the recipient
quic_conn instance. Strengthen this code part by adding two BUG_ON_HOT()
to ensure thread safety.
Amaury Denoyelle [Thu, 27 Jun 2024 16:50:18 +0000 (18:50 +0200)]
REORG: quic: remove quic_cid_trees reference from proto_quic
Previous commit removed access/manipulation to QUIC CID global tree
outside of quic_cid module. This ensures that proper locking is always
performed.
This commit finalizes this cleanup by marking CID global tree as static
only to quic_cid source file. Initialization of this tree is removed
from proto_quic and now performed using dedicated initcalls
quic_alloc_global_cid_tree().
As a side change, complete CID global tree documentation, in particular
to explain CID global tree artificial splitting and ODCID handling.
Overall, the code is now clearer and safer.
Amaury Denoyelle [Thu, 27 Jun 2024 16:08:54 +0000 (18:08 +0200)]
MINOR: quic: remove access to CID global tree outside of quic_cid module
haproxy generates for each QUIC connection a set of CID. The peer must
reuse them as DCID for its emitted packet. On datagram reception, DCID
field serves as identifier to dispatch them on their correct thread.
These CIDs are stored in a global CID tree. Access to this data
structure must always be protected with CID_LOCK. This commit is a
refactoring to regroup all CID tree access in quic_cid module. Several
code parts are ajusted :
* quic_cid_insert() is extended to check for insertion race-condition.
This is useful on quic_conn instantiation. Code where such race cannot
happen can use unsafe _quic_cid_insert() instead.
* on RETIRE_CONNECTION_ID frame reception, existing quic_cid_delete()
function is used.
* remove tree lookup from qc_check_dcid(), extracted in the new
quic_cmp_cid_conn() function. Ultimately, the latter should be removed
as CID lookup could be conducted on quic_conn owned tree without
locking.
BUG/MINOR: jwt: don't try to load files with HMAC algorithm
When trying to use a HMAC algorithm (HS256, HS384, HS512) the
sample_conv_jwt_verify_check() function of the converter tries to load a
file even if it is only supposed to contain a secret instead of a path.
When using lua, the check function is called at runtime so it even tries
to load file at each call... This fixes the issue for HMAC algorithm
but this is still a problem with the other algorithms, since we don't
have a way of pre-loading files before the call.
Another solution must be found to prevent disk IO with lua using other
algorithms.
BUG/MEDIUM: server: fix race on server_atomic_sync()
The following patch fixes a race condition during server addr/port
update : cd994407a9545a8d84e410dc0cc18c30966b70d8
BUG/MAJOR: server/addr: fix a race during server addr:svc_port updates
The new update mechanism is implemented via an event update. It uses
thread isolation to guarantee that no other thread is accessing server
addr/port. Furthermore, to ensure server instance is not deleted just
before the event handler, server instance is lookup via its ID in proxy
tree.
However, thread isolation is only entered after server lookup. This
leaves a tiny race condition as the thread will be marked as harmless
and a concurrent thread can delete the server in the meantime. This
causes server_atomic_sync() to manipulated a deleted server instance to
reinsert it in used_server_addr backend tree. This can cause a segfault
during this operation or possibly on a future used_server_addr tree
access.
This issue was detected by criteo. Several backtraces were retrieved,
each related to server addr_node insert or delete operation, either in
srv_set_addr_desc(), or add/delete dynamic server handlers.
To fix this, simply extend thread isolation section to start it before
server lookup. This ensures that once retrieved the server cannot be
deleted until its addr/port are updated. To ensure this issue won't
happen anymore, a new BUG_ON() is added in srv_set_addr_desc().
Also note that ebpt_delete() is now called every time on delete handler
as this is a safe idempotent operation.
To reproduce these crashes, a script was executed to add then remove
different servers every second. In parallel, the following CLI command
was issued repeatdly without any delay to force multiple update on
servers port :
set server <srv> addr 0.0.0.0 port $((1024 + RANDOM % 1024))
This must be backported at least up to 3.0. If above mentionned patch
has been selected for previous version, this commit must also be
backported on them.
BUG/MEDIUM: hlua/cli: Fix lua CLI commands to work with applet's buffers
In 3.0, the CLI applet was rewritten to use its own buffers. However, the
lua part, used to register CLI commands at runtime, was not updated
accordingly. It means the lua CLI commands still try to write in the channel
buffers. This is of course totally unexepected and not supported. Because of
this bug, the applet hangs intead of returning the command result.
The registration of lua CLI commands relies on the lua TCP applets. So the
send and receive functions were fixed to use the applet's buffer when it is
required and still use the channel buffers otherwies. This way, other lua
TCP applets can still run on the legacy mode, without the applet's buffers.
BUG/MINOR: promex: Remove Help prefix repeated twice for each metric
When the support for modules was added, the function producing the #HELP
line of each metric was refactored. Since then, the prefix "#HELP
<metric-name>" is printed twice because a code block was not removed. It is
now fixed.
Willy Tarreau [Sun, 30 Jun 2024 04:23:30 +0000 (06:23 +0200)]
BUG/MEDIUM: quic: fix possible exit from qc_check_dcid() without unlocking
Locking of the CID tree was extended in qc_check_dcid() by recent commit 05f59a5 ("BUG/MINOR: quic: fix race condition in qc_check_dcid()") but
there was a direct return from the middle of the function which was not
covered by the unlock, resulting in the function keeping the lock on
success return.
Let's just remove this return and replace it with a variable to merge all
exit paths.
This must be backported wherever the fix above is backported.
BUG/MINOR: quic: Wrong datagram building when probing.
This issue was revealed by chacha20 interop test which very often fails with
ngtcp2 as client. This was due to the fact that 2 application level packets could
be coalesced into the same datagram as revealed by such a capture:
On the peer side these two packet are considered as a unique one
because there may be only one packet by datagram at application encryption
level and reported as a STREAM frame encoding error:
Note here that the sum of the two packet sizes (from capture) is the same as the
packet length reporte by ngtcp2: 187+38 = 225. It also seems that wireshark tries
to parse as much as packet into the same datagram, regardless of the QUIC protocol
rules.
Haproxy traces revealed that this could happen at least when probing the peer.
The recent low level packet building modifications aim was to build
as much as datagrams into the same buffer. But it seems that the
probing packet case treatment has been broken. That said, I have not
identified impacted commit. This issue could be reproduced inside
interop test environment (no possible git bisection).
To fix this, rely on the <probe> variable value to identify if the last
packet built by qc_prep_pkts() was a probing one, then try to
coalesce some others packet into the same datagram if this was not the case.
Of course the test on <probe> value has to be done before setting it
for the next packet.
Willy Tarreau [Sat, 29 Jun 2024 09:28:41 +0000 (11:28 +0200)]
[RELEASE] Released version 3.1-dev2
Released version 3.1-dev2 with the following main changes :
- BUG/MINOR: log: fix broken '+bin' logformat node option
- DEBUG: hlua: distinguish burst timeout errors from exec timeout errors
- REGTESTS: ssl: fix some regtests 'feature cmd' start condition
- BUG/MEDIUM: ssl: AWS-LC + TLSv1.3 won't do ECDSA in RSA+ECDSA configuration
- MINOR: ssl: activate sigalgs feature for AWS-LC
- REGTESTS: ssl: activate new SSL reg-tests with AWS-LC
- BUG/MEDIUM: proxy: fix email-alert invalid free
- REORG: mailers: move free_email_alert() to mailers.c
- BUG/MINOR: proxy: fix email-alert leak on deinit() (2nd try)
- DOC: configuration: fix alphabetical order of bind options
- DOC: management: document ptr lookup for table commands
- BUG/MAJOR: quic: fix padding with short packets
- BUG/MAJOR: quic: do not loop on emission on closing/draining state
- MINOR: sample: date converter takes HTTP date and output an UNIX timestamp
- SCRIPTS: git-show-backports: do not truncate git-show output
- DOC: api/event_hdl: small updates, fix an example and add some precisions
- BUG/MINOR: h3: fix crash on STOP_SENDING receive after GOAWAY emission
- BUG/MINOR: mux-quic: fix crash on qcs SD alloc failure
- BUG/MINOR: h3: fix BUG_ON() crash on control stream alloc failure
- BUG/MINOR: quic: fix BUG_ON() on Tx pkt alloc failure
- DEV: flags/show-fd-to-flags: adapt to recent versions
- MINOR: capabilities: export capget and __user_cap_header_struct
- MINOR: capabilities: prepare support for version 3
- MINOR: capabilities: use _LINUX_CAPABILITY_VERSION_3
- MINOR: cli/debug: show dev: add cmdline and version
- MINOR: cli/debug: show dev: show capabilities
- MINOR: debug: print gdb hints when crashing
- BUILD: debug: also declare strlen() in __ABORT_NOW()
- BUILD: Missing inclusion header for ssize_t type
- BUG/MINOR: hlua: report proper context upon error in hlua_cli_io_handler_fct()
- MINOR: cfgparse/log: remove leftover dead code
- BUG/MEDIUM: stick-table: Decrement the ref count inside lock to kill a session
- MINOR: stick-table: Always decrement ref count before killing a session
- REORG: init: do MODE_CHECK_CONDITION logic first
- REORG: init: encapsulate CHECK_CONDITION logic in a func
- REORG: init: encapsulate 'reload' sockpair and master CLI listeners creation
- REORG: init: encapsulate code that reads cfg files
- BUG/MINOR: server: fix first server template name lookup UAF
- MINOR: activity: make the memory profiling hash size configurable at build time
- BUG/MEDIUM: server/dns: prevent DOWN/UP flap upon resolution timeout or error
- BUG/MEDIUM: h3: ensure the ":method" pseudo header is totally valid
- BUG/MEDIUM: h3: ensure the ":scheme" pseudo header is totally valid
- BUG/MEDIUM: quic: fix race-condition in quic_get_cid_tid()
- BUG/MINOR: quic: fix race condition in qc_check_dcid()
- BUG/MINOR: quic: fix race-condition on trace for CID retrieval
Amaury Denoyelle [Thu, 27 Jun 2024 16:52:23 +0000 (18:52 +0200)]
BUG/MINOR: quic: fix race-condition on trace for CID retrieval
quic_rx_pkt_retrieve_conn() is used when parsing a received datagram
from the listener socket. It returned the quic_conn instance
corresponding to the first packet DCID, unless it is mapped to another
thread.
As expected, global CID tree access is protected by a lock in the
function. However, there is a race condition due to the final trace
where qc instance is dereferenced outside of the lock. Fix this by
adding a new trace under lock protection and remove qc deferencement at
function end.
Amaury Denoyelle [Thu, 27 Jun 2024 16:15:08 +0000 (18:15 +0200)]
BUG/MINOR: quic: fix race condition in qc_check_dcid()
qc_check_dcid() is a function which check that a DCID is associated to
the expected quic_conn instance. This is used for quic_conn socket
receive handler as there is a tiny risk that a datagram to another
connection was received on this socket.
As other operations on global CID tree, a lock must be used to protect
against race condition. However, as previous commit, lock was not held
long enough as CID tree node is accessed outside of the lock region. To
fix this, increase critical section until CID dereferencement is done.
The impact of this bug should be similar to the previous one. However,
risk of crash are even less reduced as it should be extremely rare to
receive datagram for other connections on a quic_conn socket. As such,
most of the time first check condition of qc_check_dcid() is enough.
Amaury Denoyelle [Thu, 27 Jun 2024 15:15:55 +0000 (17:15 +0200)]
BUG/MEDIUM: quic: fix race-condition in quic_get_cid_tid()
haproxy generates CID for clients which reuse them as DCID on their
packets. These CID are stored in a global tree quic_cid_trees. Each
operation on this tree must be done under lock protection.
quic_get_cid_tid() is a function which lookups a CID in global tree and
return the associated thread ID. This is used on datagram reception on
listener socket before redispatching the datagram to the correct thread.
This function uses a lock to protect quic_cid_trees access. However,
lock region is too small as CID tree node is accessed outside of it. Fix
this by extending lock protection for CID dereferencement until thread
ID is retrieved.
The impact of this bug is unknown, but it may possible cause crashes.
However, it is probably rare as most of datagram reception is done on
quic_conn socket which does not uses quic_get_cid_tid().
Amaury Denoyelle [Fri, 28 Jun 2024 08:50:19 +0000 (10:50 +0200)]
BUG/MEDIUM: h3: ensure the ":scheme" pseudo header is totally valid
Ensure pseudo-header scheme is only constitued of valid characters
according to RFC 9110. If an invalid value is found, the request is
rejected and stream is resetted.
It's the same as for previous commit "BUG/MEDIUM: h3: ensure the
":method" pseudo header is totally valid" except that this time it
applies to the ":scheme" pseudo header.
Amaury Denoyelle [Fri, 28 Jun 2024 08:43:19 +0000 (10:43 +0200)]
BUG/MEDIUM: h3: ensure the ":method" pseudo header is totally valid
Ensure pseudo-header method is only constitued of valid characters
according to RFC 9110. If an invalid value is found, the request is
rejected and stream is resetted.
Previously only characters forbidden in headers were rejected (NUL/CR/LF),
but this is insufficient for :method, where some other forbidden chars
might be used to trick a non-compliant backend server into seeing a
different path from the one seen by haproxy. Note that header injection
is not possible though.
This must be backported up to 2.6.
Many thanks to Yuki Mogi of FFRI Security Inc for the detailed report
that allowed to quicky spot, confirm and fix the problem.
BUG/MEDIUM: server/dns: prevent DOWN/UP flap upon resolution timeout or error
This is a complementary patch to c16eba818 ("BUG/MEDIUM: server/dns:
preserve server's port upon resolution timeout or error").
Indeed, since c16eba818, the port is properly preserved, but unsetting
server's address this way results in server_atomic_sync() function
thinking that we're actually setting a new address and not unsetting
the previous one because addr family is != AF_UNSPEC.
Upon DNS timeout, this could be observed:
[WARNING] (2588257) : Server http/s1 is going DOWN for maintenance (DNS timeout status). 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] (2588257) : Server http/s1 ('test1.localhost') is UP/READY (resolves again).
Notice that server timeouts and then immediately resolves again. Of course
in this case case the server's address was properly set to 0, meaning
that the server will not receive any traffic, but it is confusing and
could result in haproxy temporarily thinking that the server is actually
available while it's not.
To properly fix the issue and restore historical behavior, let's
explicitly set inetaddr's family to AF_UNSPEC after fetching original
server's address.
Willy Tarreau [Thu, 27 Jun 2024 15:53:18 +0000 (17:53 +0200)]
MINOR: activity: make the memory profiling hash size configurable at build time
The MEMPROF_HASH_BITS variable was set to 10 without a possibility to
change it (beyond patching the code). After seeing a few reports already
with "other" being listed and a list with close to 1024 entries, it looks
like it's about time to either increase the hash size, or at least make
it configurable for special cases. As a reminder, in order to remain
fast, the algorithm searches no more than 16 places after the hash, so
when a table is almost full, searches are long and new places are rare.
The present patch just makes it possible to redefine it by passing
"-DMEMPROF_HASH_BITS=11" or "-DMEMPROF_HASH_BITS=12" in CFLAGS, and
moves the definition to defaults.h to make it easier to find. Such
values should be way sufficient for the vast majority of use cases.
Maybe in the future we'd change the default. At least this version
should be backported to ease rebuilds, say, till 2.8 or so.
BUG/MINOR: server: fix first server template name lookup UAF
This is a follow-up for 7223296 ("BUG/MINOR: server: fix first server
template not being indexed").
Indeed, in 7223296 we added a new call to _srv_parse_set_id_from_prefix()
for the first server before handling additional ones. But we actually
overlooked the fact that _srv_parse_set_id_from_prefix() was already
performed at the end of _srv_parse_tmpl_init() for the same server.
Since _srv_parse_set_id_from_prefix() frees srv->id, it results in UAF
when performing name lookups on the first server, because used_server_name
node key still uses the freed string pointer.
The early _srv_parse_set_id_from_prefix() call (added in 7223296) and
the original one perform the same task, except that the new one is
followed by name node insertion logic required for name lookups to work
properly. So let's simply get rid of the old one at the end of the
function.
_srv_parse_set_id_from_prefix() in the 'err:' label was also removed since
is is now useless as well starting with 7223296 and would trigger the same
bug on error paths. Thanks to Amaury for noticing it.
This bug was discovered while trying to address GH issue #2620.
Thanks to @x-yuri for his detailed report (with working repro).
REORG: init: encapsulate code that reads cfg files
Haproxy master process should not read its configuration the second time
after performing reexec and passing to MODE_MWORKER_WAIT. So, to make
this part of init() function more readable and to distinguish better the
point, where configs have been read, let's encapsulate it in a separate
function.
REORG: init: encapsulate 'reload' sockpair and master CLI listeners creation
Let's encapsulate the logic of 'reload' sockpair and master CLI listeners
creation, used by master CLI into a separate function, as we needed this
only in master-worker runtime mode. This makes the code of init() more
readable.
REORG: init: encapsulate CHECK_CONDITION logic in a func
As MODE_CHECK_CONDITION logic terminates the process anyway, no matter if
the test for the provided condition was successfull or not, let's
encapsulate it in a separate function. This makes the code of init() more
readable.
In MODE_CHECK_CONDITION we only parse check_condition string, provided by
'-cc', and then we evaluate it. Haproxy process terminates at the
end of {if..else} block anyway, if the test has failed or passed. So, it
will be more appropriate to perform MODE_CHECK_CONDITION test first and
then do all other process runtime mode verifications.
MINOR: stick-table: Always decrement ref count before killing a session
Guarded functions to kill a sticky session, stksess_kill()
stksess_kill_if_expired(), may or may not decrement and test its reference
counter before really killing it. This depends on a parameter. If it is set
to non-zero value, the ref count is decremented and if it falls to zero, the
session is killed. Otherwise, if this parameter is equal to zero, the
session is killed, regardless the ref count value.
In the code, these functions are always called with a non-zero parameter and
the ref count is always decremented and tested. So, there is no reason to
still have a special case. Especially because it is not really easy to say
if it is supported or not. Does it mean it is possible to kill a sticky
session while it is still referenced somewhere ? probably not. So, does it
mean it is possible to kill a unreferenced session ? This case may be
problematic because the session is accessed outside of any lock and thus may
be released by another thread because it is unreferenced. Enlarging scope of
the lock to avoid any issue is possible but it is a bit of shame to do so
because there is no usage for now.
The best is to simplify the API and remove this case. Now, stksess_kill()
and stksess_kill_if_expired() functions always decrement and test the ref
count before killing a sticky session.
BUG/MEDIUM: stick-table: Decrement the ref count inside lock to kill a session
When we try to kill a session, the shard must be locked before decrementing
the ref count on the session. Otherwise, the ref count can fall to 0 and a
purge task (stktable_trash_oldest or process_table_expire) may release the
session before we have the opportunity to acquire the lock on the shard to
effectively kill the session. This could lead to a double free.
Here is the scenario:
Thread 1 Thread 2
sktsess_kill(ts)
if (ATOMIC_DEC(&ts->ref_cnt) != 0)
return
/* here the ref count is 0 */
stktable_trash_oldest()
LOCK(&sh_lock)
if (!ATOMIC_LOAD(&ts->ref_cnf))
__stksess_free(ts)
UNLOCK(&sh_lock)
/* here the session was released */
LOCK(&sh_lock)
__stksess_free(ts) <--- double free
UNLOCK(&sh_lock)
The bug was introduced in 2.9 by the commit 7968fe3889 ("MEDIUM:
stick-table: change the ref_cnt atomically"). The ref count must be
decremented inside the lock for stksess_kill() and sktsess_kill_if_expired()
function.
This patch should fix the issue #2611. It must be backported as far as 2.9. On
the 2.9, there is no sharding. All the table is locked. The patch will have to
be adapted.
Remove development leftover introduced by commit 15e9c7da6 ("MINOR: log:
add log-profile parsing logic").
Indeed, since "log-profile" section keyword is registered via
REGISTER_CONFIG_SECTION() macro, it is not relevant to declare it in
common_kw_list[] from cfgparse-global.c. All it does is that it could
confuse the user by suggesting him to use "log-profile" inside a global
section when trying to find a best match in cfg_parse_global().
In file included from src/ncbuf.c:19:
src/ncbuf.c: In function 'ncb_write_off':
include/haproxy/bug.h:144:10: error: unknown type name 'ssize_t'
144 | extern ssize_t write(int, const void *, size_t); \
Willy Tarreau [Wed, 26 Jun 2024 06:02:09 +0000 (08:02 +0200)]
BUILD: debug: also declare strlen() in __ABORT_NOW()
Previous commit 8f204fa8ae ("MINOR: debug: print gdb hints when crashing")
broken on the CI where strlen() isn't known. Let's forward-declare it in
the __ABORT_NOW() functions, just like write(). No backport is needed.
Willy Tarreau [Fri, 21 Jun 2024 12:04:46 +0000 (14:04 +0200)]
MINOR: debug: print gdb hints when crashing
To make bug reporting easier for users, when crashing, let's suggest
what to do. Typically when a BUG_ON() matches, only the current thread
is useful the vast majority of the time, while when the watchdog
triggers, all threads are interesting.
The messages are printed at the end after the dump. We may adjust these
with wiki links in the future is more detailed instructions are relevant.
If haproxy compiled with Linux capabilities support, let's show process
capabilities before applying the configuration and at runtime in 'show dev'
command output. This maybe useful for debugging purposes. Especially in
cases, when process changes its UID and GID to non-priviledged or it
has started and run under non-priviledged UID and needed capabilities are
set by admin on the haproxy binary.
MINOR: cli/debug: show dev: add cmdline and version
'show dev' command is very convenient to obtain haproxy debugging information,
while process is run in container. Let's extend its output with version and
cmdline. cmdline is useful in a way, as it shows absolute binary path and its
arguments, because sometimes the person, who is debugging failing container is
not the same, who has created and deployed it.
argc and argv are stored in the exported global structure, because
feed_post_mortem() is added as a post check function callback in the
post_check_list. So we can't simply change the signature of
feed_post_mortem(), without breaking other post check callbacks APIs.
Parsers are not supposed to modify argv, so we can safely bypass its pointer
to debug_parse_cli_show_dev(), without copying all argument stings somewhere
in the heap or on stack.
MINOR: capabilities: use _LINUX_CAPABILITY_VERSION_3
Linux kernel shows the warning below, when _LINUX_CAPABILITY_VERSION_1 is
used in capset() and capget().
[1710243.523230] capability: warning: `haproxy' uses 32-bit capabilities (legacy support in use)
This triggers questions from users. Warning is shown by kernel, because
since Linux 2.6.25, 64-bit capabilities support was introduced in
_LINUX_CAPABILITY_VERSION_2. It's in order to be able to continiously
extend capabilities list with the new ones.
We can't use _LINUX_CAPABILITY_VERSION_2, because this version triggers
another warning, according linux/kernel/capability.c (see also more details
about it in comments from kernel sources and in man capset(2)).
if (get_user(version, &header->version))
return -EFAULT;
switch (version) {
case _LINUX_CAPABILITY_VERSION_1:
warn_legacy_capability_use();
*tocopy = _LINUX_CAPABILITY_U32S_1;
break;
case _LINUX_CAPABILITY_VERSION_2:
warn_deprecated_v2();
fallthrough; /* v3 is otherwise equivalent to v2 */
case _LINUX_CAPABILITY_VERSION_3:
*tocopy = _LINUX_CAPABILITY_U32S_3;
break;
default:
...
So, to avoid any warnings, lets use _LINUX_CAPABILITY_VERSION_3, which
according to comments in linux/kernel/capability.c, has the same
functionality as _LINUX_CAPABILITY_VERSION_2 (i.e. array of 2
__user_cap_data_struct with 32-bits integers for each capability set), but
comes in Linux 2.6.26 with a header change, in order to protect legacy
source code.
For the moment, we don't authorize capabilities higher, than CAP_SYS_ADMIN
(21-st bit), so we always check the "low" 32 bits, i.e.
__user_cap_data_struct[0].
MINOR: capabilities: prepare support for version 3
Commit e338d263a76a ("Add 64-bit capability support to the kernel")
introduces in the kernel _LINUX_CAPABILITY_VERSION_1 and
_LINUX_CAPABILITY_VERSION_2 and its corresponded magic numbers "1"
(_LINUX_CAPABILITY_U32S_1) and "2" (_LINUX_CAPABILITY_VERSION_2).
Capabilities sets, since this commit, are composed as an arrays of
__user_cap_data_struct with length defined in version's magic number
(e.g. struct __user_cap_data_struct kdata[_LINUX_CAPABILITY_U32S_1]).
These magic numbers also help the kernel to figure out how many data
(in __user_cap_data_struct "units") it needs to copy_from/to_user in
capset/capget syscalls.
In order to use _LINUX_CAPABILITY_VERSION_3 in the next commit (it has the
same functionality as version 2), let's follow the kernel code and let's
allocate memory to store 32-capabilities as an array of
__user_cap_data_struct with the length of 1 (_LINUX_CAPABILITY_U32S_1).
MINOR: capabilities: export capget and __user_cap_header_struct
To be able to show process capabilities before applying its configuration and
also at runtime in 'show dev' command output, we need to export the wrapper
around capget() syscall. It also seems more handy to place
__user_cap_header_struct in .data section and declare it as globally
accessible, as we always fill it with the same values. This avoids allocate
and fill these 8 bytes each time on the stack frame, when capget() or capset()
wrappers are called.
Willy Tarreau [Mon, 24 Jun 2024 09:05:08 +0000 (11:05 +0200)]
DEV: flags/show-fd-to-flags: adapt to recent versions
The script hadn't been updated since it was introduced, and the
hard-coded field 12 doesn't match anymore (it's 16 now). Let's just
use "grep -o cflg..." to extract the desired part more flexibly.
This can be backported at least to 3.0, probably further, but it
will need to be tested prior to this. Better not bring it too far,
it's only used when debugging.
Amaury Denoyelle [Thu, 20 Jun 2024 15:54:04 +0000 (17:54 +0200)]
BUG/MINOR: quic: fix BUG_ON() on Tx pkt alloc failure
On quic_tx_packet allocation failure, it is possible to trigger BUG_ON()
crash on INITIAL packet building. This statement is responsible to
ensure INITIAL packets are padded to 1.200 bytes as required. If a
packet on higher encryption level allocation fails, PADDING frame cannot
properly encoded, despite the INITIAL packet properly built.
This crash happens due to qc_txb_store() invokation after quic_tx_packet
allocation failure to validate already built packets. However, this
statement is unneeded as qc_purge_tx_buf() is called just after. Simply
remove qc_txb_store() to fix this issue.
Amaury Denoyelle [Thu, 20 Jun 2024 15:51:35 +0000 (17:51 +0200)]
BUG/MINOR: h3: fix BUG_ON() crash on control stream alloc failure
BUG_ON() from qcc_set_error() is triggered on HTTP/3 control stream
allocation failure. This is caused because both h3_finalize() and
qcc_init_stream_local() call qcc_set_error() which is forbidden to
prevent error code erasure.
Fix this by removing qcc_set_error() invocation from h3_finalize() on
allocation failure. Note that this function is still responsible to use
it on SETTING frame emission failure.