git.ipfire.org Git - thirdparty/haproxy.git/log

MEDIUM: listener: rework thread assignment to consider all groups

Till now threads were assigned in listener_accept() to other threads of
the same group only, using a single group mask. Now that we have all the
relevant info (array of listeners of the same shard), we can spread the
thr_idx to cover all assigned groups. The thread indexes now contain the
group number in their upper bits, and the indexes run over te whole list
of threads, all groups included.

One particular subtlety here is that switching to a thread from another
group also means switching the group, hence the listener. As such, when
changing the group we need to update the connection's owner to point to
the listener of the same shard that is bound to the target group.

MINOR: listener: make accept_queue index atomic

There has always been a race when checking the length of an accept queue
to determine which one is more loaded that another, because the head and
tail are read at two different moments. This is not required, we can merge
them as two 16 bit numbers inside a single 32-bit index that is always
accessed atomically. This way we read both values at once and always have
a consistent measurement.

MEDIUM: config: permit to start a bind on multiple groups at once

Now it's possible for a bind line to span multiple thread groups. When
this happens, the first one will become the reference and will be entirely
set up, and the subsequent ones will be duplicated from this reference,
so that they can be registered in distinct groups. The reference is
always setup and started first so it is always available when the other
ones are started.

The doc was updated to reflect this new possibility with its limitations
and impacts, and the differences with the "shards" option.

MINOR: proto: skip socket setup for duped FDs

It's not strictly necessary, but it's still better to avoid setting
up the same socket multiple times when it's being duplicated to a few
FDs. We don't change that for inherited ones however since they may
really need to be set up, so we only skip duplicated ones.

MEDIUM: proto: duplicate receivers marked RX_F_MUST_DUP

The different protocol's ->bind() function will now check the receiver's
RX_F_MUST_DUP flag to decide whether to bind a fresh new listener from
scratch or reuse an existing one and just duplicate it. It turns out
that the existing code already supports reusing FDs since that was done
as part of the FD passing and inheriting mechanism. Here it's not much
different, we pass the FD of the reference receiver, it gets duplicated
and becomes the new receiver's FD.

These FDs are also marked RX_F_INHERITED so that they are not exported
and avoid being touched directly (only the reference should be touched).

MINOR: receiver: add RX_F_MUST_DUP to indicate that an rx must be duped

The purpose of this new flag will be to mark that some listeners
duplicate their reference's FD instead of trying to setup a completely
new listener from scratch. This will be used when multiple groups want
to listen to the same socket, via multiple FDs.

MINOR: receiver: add a struct shard_info to store info about each shard

In order to create multiple receivers for one multi-group shard, we'll
need some more info about the shard. Here we store:
  - the number of groups (= number of receivers)
  - the number of threads (will be used for accept LB)
  - pointer to the reference rx (to get the FD and to find all threads)
  - pointers to the other members (to iterate over all threads)

For now since there's only one group per shard it remains simple. The
listener deletion code already takes care of removing the current
member from its shards list and moving others' reference to the last
one if it was their reference (so as to avoid o(n^2) updates during
ordered deletes).

Since the vast majority of setups will not use multi-group shards, we
try to save memory usage by only allocating the shard_info when it is
needed, so the principle here is that a receiver shard_info==NULL is
alone and doesn't share its socket with another group.

Various approaches were considered and tests show that the management
of the listeners during boot makes it easier to just attach to or
detach from a shard_info and automatically allocate it if it does not
exist, which is what is being done here.

For now the attach code is not called, but detach is already called
on delete.

MINOR: listener: support another thread dispatch mode: "fair"

This new algorithm for rebalancing incoming connections to multiple
threads is simpler and instead of considering the threads load, it will
only cycle through all of them, offering a fair share of the traffic to
each thread. It may be well suited for short-lived connections but is
also convenient for very large thread counts where it's not always certain
that the least loaded thread will always be found.

MINOR: quic_sock: index li->per_thr[] on local thread id, not global one

There's a li_per_thread array in each listener for use with QUIC
listeners. Since thread groups were introduced, this array can be
allocated too large because global.nbthread is allocated for each
listener, while only no more than MIN(nbthread,MAX_THREADS_PER_GROUP)
may be used by a single listener. This was because the global thread
ID is used as the index instead of the local ID (since a listener may
only be used by a single group). Let's just switch to local ID and
reduce the allocated size.

MINOR: quic: support migrating the listener as well

When migrating a quic_conn to another thread, we may need to also
switch the listener if the thread belongs to another group. When
this happens, the freshly created connection will already have the
target listener, so let's just pick it from the connection and use
it in qc_set_tid_affinity(). Note that it will be the caller's
responsibility to guarantee this.

MINOR: server/event_hdl: prepare for server event data wrapper

Adding the possibility to publish an event using a struct wrapper
around existing SERVER events to provide additional contextual info.

Using the specific struct wrapper is not required: it is supported
to cast event data as a regular server event data struct so
that we don't break the existing API.

However, casting event data with a more explicit data type allows
to fetch event-only relevant hints.

MEDIUM: server: split srv_update_status() in two functions

Considering that srv_update_status() is now synchronous again since
3ff577e1 ("MAJOR: server: make server state changes synchronous again"),
and that we can easily identify if the update is from an operational
or administrative context thanks to "MINOR: server: pass adm and op cause
to srv_update_status()".

And given that administrative and operational updates cannot be cumulated
(since srv_update_status() is called synchronously and independently for
admin updates and state/operational updates, and the function directly
consumes the changes).

We split srv_update_status() in 2 distinct parts:

Either <type> is 0, meaning the update is an operational update which
is handled by directly looking at cur_state and next_state to apply the
proper transition.
Also, the check to prevent operational state from being applied
if MAINT admin flag is set is no longer needed given that the calling
functions already ensure this (ie: srv_set_{running,stopping,stopped)

Or <type> is 1, meaning the update is an administrative update, where
cur_admin and next_admin are evaluated to apply the proper transition and
deduct the resulting server state (next_state is updated implicitly).

Once this is done, both operations share a common code path in
srv_update_status() to update proxy and servers stats if required.

Thanks to this change, the function's behavior is much more predictable,
it is not an all-in-one function anymore. Either we apply an operational
change, else it is an administrative change. That's it, we cannot mix
the 2 since both code paths are now properly separated.

MINOR: server: pass adm and op cause to srv_update_status()

Operational and administrative state change causes are not propagated
through srv_update_status(), instead they are directly consumed within
the function to provide additional info during the call when required.

Thus, there is no valid reason for keeping adm and op causes within
server struct. We are wasting space and keeping uneeded complexity.

We now exlicitly pass change type (operational or administrative) and
associated cause to srv_update_status() so that no extra storage is
needed since those values are only relevant from srv_update_status().

CLEANUP: server: fix srv_set_{running, stopping, stopped} function comment

Fixing function comments for the server state changing function since they
still refer to asynchonous propagation of server state which is no longer
in play.
Moreover, there were some mixups between running/stopping.

CLEANUP: server: remove unused variables in srv_update_status()

check and px local variable aliases are not very useful.
Let's remove them and use s->check and s->proxy instead.

MINOR: server: change srv_op_st_chg_cause storage type

This one is greatly inspired by "MINOR: server: change adm_st_chg_cause storage type".

While looking at current srv_op_st_chg_cause usage, it was clear that
the struct needed some cleanup since some leftovers from asynchronous server
state change updates were left behind and resulted in some useless code
duplication, and making the whole thing harder to maintain.

Two observations were made:

- by tracking down srv_set_{running, stopped, stopping} usage,
  we can see that the <reason> argument is always a fixed statically
  allocated string.
- check-related state change context (duration, status, code...) is
  not used anymore since srv_append_status() directly extracts the
  values from the server->check. This is pure legacy from when
  the state changes were applied asynchronously.

To prevent code duplication, useless string copies and make the reason/cause
more exportable, we store it as an enum now, and we provide
srv_op_st_chg_cause() function to fetch the related description string.
HEALTH and AGENT causes (check related) are now explicitly identified to
make consumers like srv_append_op_chg_cause() able to fetch checks info
from the server itself if they need to.

MINOR: server: srv_append_status refacto

srv_append_status() has become a swiss-knife function over time.
It is used from server code and also from checks code, with various
inputs and distincts code paths, making it very hard to guess the
actual behavior of the function (resulting string output).

To simplify the logic behind it, we're dividing it in multiple contextual
functions that take simple inputs and do explicit things, making them
more predictable and easier to maintain.

MINOR: server: change adm_st_chg_cause storage type

Even though it doesn't look like it at first glance, this is more like
a cleanup than an actual code improvement:

Given that srv->adm_st_chg_cause has been used to exclusively store
static strings ever since it was implemented, we make the choice to
store it as an enum instead of a fixed-size string within server
struct.

This will allow to save some space in server struct, and will make
it more easily exportable (ie: event handlers) because of the
reduced memory footprint during handling and the ability to later get
the corresponding human-readable message when it's explicitly needed.

MINOR: server: propagate lb changes through srv_lb_propagate()

Now that we have a generic srv_lb_propagate(s) function, let's
use it each time we explicitly wan't to set the status down as
well.

Indeed, it is tricky to try to handle "down" case explicitly,
instead we use srv_lb_propagate() which will call the proper
function that will handle the new server state.

This will allow some code cleanup and will prevent any logic
error.

This commit depends on:
- "MINOR: server: propagate server state change to lb through single function"

MINOR: server: propagate server state change to lb through single function

Use a dedicated helper function to propagate server state change to
lb algorithms, since it is performed at multiple places within
srv_update_status() function.

MINOR: server: central update for server counters on state change

Based on "BUG/MINOR: server: don't miss server stats update on server
state transitions", we're also taking advantage of the new centralized
logic to update down_trans server counter directly from there instead
of multiple places.

BUG/MINOR: server: don't use date when restoring last_change from state file

When restoring from a state file: the server "Status" reports weird values on
the html stats page:

"5s UP" becomes -> "? UP" after the restore

This is due to a bug in srv_state_srv_update(): when restoring the states
from a state file, we rely on date.tv_sec to compute the process-relative
server last_change timestamp.

This is wrong because everywhere else we use now.tv_sec when dealing
with last_change, for instance in srv_update_status().

date (which is Wall clock time) deviates from now (monotonic time) in the
long run.

They should not be mixed, and given that last_change is an internal time value,
we should rely on now.tv_sec instead.

last_change export through "show servers state" cli is safe since we export
a delta and not the raw time value in dump_servers_state():

srv_time_since_last_change = now.tv_sec - srv->last_change

--

While this bug affects all stable versions, it was revealed in 2.8 thanks
to 28360dc ("MEDIUM: clock: force internal time to wrap early after boot")
This is due to the fact that "now" immediately deviates from "date",
whereas in the past they had the same value when starting.

Thus prior to 2.8 the bug is trickier since it could take some time for
date and now to deviate sufficiently for the issue to arise, and instead
of reporting absurd values that are easy to spot it could just result in
last_change becoming inconsistent over time.

As such, the fix should be backported to all stable versions.
[for 2.2 the patch needs to be applied manually since
srv_state_srv_update() was named srv_update_state() and can be found in
server.c instead of server_state.c]

BUG/MINOR: server: don't miss server stats update on server state transitions

s->last_change and s->down_time updates were manually updated for each
effective server state change within srv_update_status().

This is rather error-prone, and as a result there were still some state
transitions that were not handled properly since at least 1.8.

ie:
- when transitionning from DRAIN to READY: downtime was updated
  (which is wrong since a server in DRAIN state should not be
   considered as DOWN)
- when transitionning from MAINT to READY: downtime was not updated
  (this can be easily seen in the html stats page)

To fix these all at once, and prevent similar bugs from being introduced,
we centralize the server last_change and down_time stats logic at the end
of srv_update_status():

If the server state changed during the call, then it means that
last_change must be updated, with a special case when changing from
STOPPED state which means the server was previously DOWN and thus
downtime should be updated.

This patch depends on:

- "MINOR: server: explicitly commit state change in srv_update_status()"

This could be backported to every stable versions.

BUG/MINOR: server: don't miss proxy stats update on server state transitions

backend "down" stats logic has been duplicated multiple times in
srv_update_status(), resulting in the logic now being error-prone.

For example, the following bugfix was needed to compensate for a
copy-paste introduced bug: d332f139
("BUG/MINOR: server: update last_change on maint->ready transitions too")

While the above patch works great, we actually forgot to update the
proxy downtime like it is done for other down->up transitions...
This is simply illustrating that the current design is error-prone,
it is very easy to miss something in this area.

To properly update the proxy downtime stats on the maint->ready transition,
to cleanup srv_update_status() and to prevent similar bugs from being
introduced in the future, proxy/backend stats update are now automatically
performed at the end of the server state change if needed.

Thus we can remove existing updates that were performed at various places
within the function, this simplifies things a bit.

This patch depends on:
- "MINOR: server: explicitly commit state change in srv_update_status()"

This could be backported to all stable versions.

Backport notes:

2.2:

Replace
struct task *srv_cleanup_toremove_conns(struct task *task, void *context, unsigned int state)

by
struct task *srv_cleanup_toremove_connections(struct task *task, void *context, unsigned short state)

MINOR: server: explicitly commit state change in srv_update_status()

As shown in 8f29829 ("BUG/MEDIUM: checks: a down server going to
maint remains definitely stucked on down state."), state changes
that don't result in explicit lb state change, require us to perform
an explicit server state commit to make sure the next state is
applied before returning from the function.

This is the case for server state changes that don't trigger lb logic
and only perform some logging.

This is quite error prone, we could easily forget a state change
combination that could result in next_state, next_admin or next_eweight
not being applied. (cur_state, cur_admin and cur_eweight would be left
with unexpected values)

To fix this, we explicitly call srv_lb_commit_status() at the end
of srv_update_status() to enforce the new values, even if they were
already applied. (when a state changes requires lb state update
an implicit commit is already performed)

Applying the state change multiple times is safe (since the next value
always points to the current value).

Backport notes:

2.2:

Replace
struct task *srv_cleanup_toremove_conns(struct task *task, void *context, unsigned int state)

by
struct task *srv_cleanup_toremove_connections(struct task *task, void *context, unsigned short state)

BUG/MINOR: server: incorrect report for tracking servers leaving drain

Report message for tracking servers completely leaving drain is wrong:

The check for "leaving drain .. via" never evaluates because the
condition !(s->next_admin & SRV_ADMF_FDRAIN) is always true in the
current block which is guarded by !(s->next_admin & SRV_ADMF_DRAIN).

For tracking servers that leave inherited drain mode, this results in the
following message being emitted:
"Server x/b is UP (leaving forced drain)"

Instead of:
"Server x/b is UP (leaving drain) via x/a"

To this fix: we check if FDRAIN is currently set, else it means that the
drain status is inherited from the tracked server (IDRAIN)

This regression was introduced with 64cc49cf ("MAJOR: servers: propagate server
status changes asynchronously."), thus it may be backported to every stable
versions.

DOC: lua: restore 80 char limitation

Restore 80 char limitation throughout the file for easier reading
on the cli, and fix some raw formatting issues without altering
html rendering.

MINOR: hlua/event_hdl: timestamp for events

'when' optional argument is provided to lua event handlers.

It is an integer representing the number of seconds elapsed since Epoch
and may be used in conjunction with lua `os.date()` function to provide
a custom format string.

MINOR: event_hdl: provide event->when for advanced handlers

For advanced async handlers only
(Registered using EVENT_HDL_ASYNC_TASK() macro):

event->when is provided as a struct timeval and fetched from 'date'
haproxy global variable.

Thanks to 'when', related event consumers will be able to timestamp
events, even if they don't work in real-time or near real-time.
Indeed, unlike sync or normal async handlers, advanced async handlers
could purposely delay the consumption of pending events, which means
that the date wouldn't be accurate if computed directly from within
the handler.

MINOR: event_hdl: dynamically allocated event data members

Add the ability to provide a cleanup function for event data passed
via the publishing function.

One use case could be the need to provide valid pointers in the safe
section of the data struct.
Cleanup function will be automatically called with data (or copy of data)
as argument when all handlers consumed the event, which provides an easy
way to release some memory or decrement refcounts to ressources that were
provided through the data struct.
data in itself may not be freed by the cleanup function, it is handled
by the API.

This would allow passing large (allocated) data blocks through the data
struct while keeping data struct size under the EVENT_HDL_ASYNC_EVENT_DATA
size limit.

To do so, when publishing an event, where we would currently do:

        struct event_hdl_cb_data_new_family event_data;

        /* safe data, available from both sync and async contexts
* may not use pointers to short-living resources
*/
        event_data.safe.my_custom_data = x;

        /* unsafe data, only available from sync contexts */
        event_data.unsafe.my_unsafe_data = y;

        /* once data is prepared, we can publish the event */
        event_hdl_publish(NULL,
                          EVENT_HDL_SUB_NEW_FAMILY_SUBTYPE_1,
                          EVENT_HDL_CB_DATA(&event_data));

We could do:

        struct event_hdl_cb_data_new_family event_data;

        /* safe data, available from both sync and async contexts
* may not use pointers to short-living resources,
* unless EVENT_HDL_CB_DATA_DM is used to ensure pointer
* consistency (ie: refcount)
*/
        event_data.safe.my_custom_static_data = x;
event_data.safe.my_custom_dynamic_data = malloc(1);

        /* unsafe data, only available from sync contexts */
        event_data.unsafe.my_unsafe_data = y;

        /* once data is prepared, we can publish the event */
        event_hdl_publish(NULL,
                          EVENT_HDL_SUB_NEW_FAMILY_SUBTYPE_1,
                          EVENT_HDL_CB_DATA_DM(&event_data, data_new_family_cleanup));

With data_new_family_cleanup func which would look like this:

      void data_new_family_cleanup(const void *data)
      {
       const struct event_hdl_cb_data_new_family *event_data = ptr;

/* some data members require specific cleanup once the event
* is consumed
*/
       free(event_data.safe.my_custom_dynamic_data);
/* don't ever free data! it is not ours */
      }

Not sure if this feature will become relevant in the future, so I prefer not
to mention it in the doc for now.

But given that the implementation is trivial and does not put a burden
on the existing API, it's a good thing to have it there, just in case.

CLEANUP: event_hdl: fix comment typo about _sync assertion

Fixing a comment relative to EVENT_HDL_ASSERT_SYNC macro where a
typo was made and the comment was lacking some context.

CLEANUP: event_hdl: updating obsolete comment for EVENT_HDL_CB_DATA

EVENT_HDL_CB_DATA macro comments were not updated during the API
refactor, fixing that.

BUG/MINOR: event_hdl: don't waste 1 event subtype slot

ESUB_INDEX(n) index macro is used exclusively with n > 0
Fixing it so that it starts numbering at 1 instead of 2.

This way, we don't waste a subtype slot in event_hdl_sub_type
struct, and we comply with the structure comments about max
supported event subtypes (currently set at 16).

If 68e692da0 ("MINOR: event_hdl: add event handler base api")
is being backported, then this commit should be backported with it.

MINOR: server/event_hdl: prepare for upcoming refactors

This commit does nothing that ought to be mentioned, except that
it adds missing comments and slighty moves some function calls
out of "sensitive" code in preparation of some server code refactors.

MINOR: hlua/event_hdl: fix return type for hlua_event_hdl_cb_data_push_args

Changing hlua_event_hdl_cb_data_push_args() return type to void since it
does not return anything useful.
Also changing its name to hlua_event_hdl_cb_push_args() since it does more
than just pushing cb data argument (it also handles event type and mgmt).

Errors catched by the function are reported as lua errors.

MINOR: hlua/event_hdl: expose proxy_uuid variable in server events

Adding proxy_uuid to ServerEvent class.
proxy_uuid contains the uuid of the proxy to which the server belongs

MINOR: hlua/event_hdl: rely on proxy_uuid instead of proxy_name for lookups

Since "MINOR: server/event_hdl: add proxy_uuid to event_hdl_cb_data_server"
we may now use proxy_uuid variable to perform proxy lookups when
handling a server event.

It is more reliable since proxy_uuid isn't subject to any size limitation

MINOR: server/event_hdl: add proxy_uuid to event_hdl_cb_data_server

Expose proxy_uuid variable in event_hdl_cb_data_server struct to
overcome proxy_name fixed length limitation.

proxy_uuid may be used by the handler to perform proxy lookups.
This should be preferred over lookups relying proxy_name.
(proxy_name is suitable for printing / logging purposes but not for
ID lookups since it has a maximum fixed length)

CLEANUP: server: fix update_status() function comment

srv_update_status() function comment says that the function "is designed
to be called asynchronously".

While this used to be true back then with 64cc49cf
("MAJOR: servers: propagate server status changes asynchronously.")

This is not true anymore since 3ff577e ("MAJOR: server: make server state changes
synchronous again")

Fixing the comment in order to better reflect current behavior.

CLEANUP: errors: fix obsolete function comments

Since 9f903af5 ("MEDIUM: log: slightly refine the output format of
alerts/warnings/etc"), messages generated by ha_{alert,warning,notice}
don't embed date/time information anymore.

Updating some old function comments that kept saying otherwise.

BUG/MINOR: quic: consume Rx datagram even on error

A BUG_ON crash can occur on qc_rcv_buf() if a Rx packet allocation
failed.

To fix this, datagram are marked as consumed even if a fatal error
occured during parsing. For the moment, only a Rx packet allocation
failure could provoke this. At this stage, it's unknown if the datagram
were partially parsed or not at all so it's better to discard it
completely.

This bug was detected using -dMfail argument.

This should be backported up to 2.7.

BUG/MINOR: quic: prevent crash on qc_new_conn() failure

Properly initialize el_th_ctx member first on qc_new_conn(). This
prevents a segfault if release should be called later due to memory
allocation failure in the function on qc_detach_th_ctx_list().

This should be backported up to 2.7.

BUG/MINOR: h3: fix crash on h3s alloc failure

Do not emit a CONNECTION_CLOSE on h3s allocation failure. Indeed, this
causes a crash as the calling function qcs_new() will also try to emit a
CONNECTION_CLOSE which triggers a BUG_ON() on qcc_emit_cc().

This was reproduced using -dMfail.

This should be backported up to 2.7.

BUG/MINOR: mux-quic: properly handle STREAM frame alloc failure

Previously, if a STREAM frame cannot be allocated for emission, a crash
would occurs due to an ABORT_NOW() statement in _qc_send_qcs().

Replace this by proper error code handling. Each stream were sending
fails are removed temporarily from qcc::send_list to a list local to
_qc_send_qcs(). Once emission has been conducted for all streams,
reinsert failed stream to qcc::send_list. This avoids to reloop on
failed streams on the second while loop at the end of _qc_send_qcs().

This crash was reproduced using -dMfail.

This should be backported up to 2.6.

BUG/MINOR: mux-quic: fix crash with app ops install failure

On MUX initialization, the application layer is setup via
qcc_install_app_ops(). If this function fails MUX is deallocated and an
error is returned.

This code path causes a crash before connection has been registered
prior into the mux_stopping_data::list for stopping idle frontend conns.
To fix this, insert the connection later in qc_init() once no error can
occured.

The crash was seen on the process closing with SUGUSR1 with a segfault
on mux_stopping_process(). This was reproduced using -dMfail.

This regression was introduced by the following patch :
commit b4d119f0c75ce7c5a977ece18dc975e14f9b460c
BUG/MEDIUM: mux-quic: fix crash on H3 SETTINGS emission

This should be backported up to 2.7.

BUG/MINOR: quic: Wrong Retry token generation timestamp computing

Again a now_ms variable value used without the ticks API. It is used
to store the generation time of the Retry token to be received back
from the client.

Must be backported to 2.6 and 2.7.

BUG/MINOR: quic: Unchecked buffer length when building the token

As server, an Initial does not contain a token but only the token length field
with zero as value. The remaining room was not checked before writting this field.

Must be backported to 2.6 and 2.7.

MINOR: quic: Do not allocate too much ack ranges

Limit the maximum number of ack ranges to QUIC_MAX_ACK_RANGES(32).

Must be backported to 2.6 and 2.7.

BUG/MINOR: quic: Stop removing ACK ranges when building packets

Since this commit:

BUG/MINOR: quic: Possible wrapped values used as ACK tree purging limit.

There are more chances that ack ranges may be removed from their trees when
building a packet. It is preferable to impose a limit to these trees. This
will be the subject of the a next commit to come.

For now on, it is sufficient to stop deleting ack range from their trees.
Remove quic_ack_frm_reduce_sz() and quic_rm_last_ack_ranges() which were
there to do that.
Make qc_frm_len() support ACK frames and calls it to ensure an ACK frame
may be added to a packet before building it.

Must be backported to 2.6 and 2.7.

MINOR: hlua: safe coroutine.create()

Overriding global coroutine.create() function in order to link the
newly created subroutine with the parent hlua ctx.
(hlua_gethlua() function from a subroutine will return hlua ctx from the
hlua ctx on which the coroutine.create() was performed, instead of NULL)

Doing so allows hlua_hook() function to support being called from
subroutines created using coroutine.create() within user lua scripts.

That is: the related subroutine will be immune to the forced-yield,
but it will still be checked against hlua timeouts. If the subroutine
fails to yield or finish before the timeout, the related lua handler will
be aborted (instead of going rogue unnoticed like it would be the case prior
to this commit)

MINOR: hlua: hook yield on known lua state

When forcing a yield attempt from hlua_hook(), we should perform it on
the known hlua state, not on a potential substate created using
coroutine.create() from an existing hlua state from lua script.

Indeed, only true hlua couroutines will properly handle the yield and
perform the required timeout checks when returning in hlua_ctx_resume().

So far, this was not a concern because hlua_gethlua() would return NULL
if hlua_hook() is not directly being called from a hlua coroutine anyway.

But with this we're trying to make hlua_hook() ready for being called
from a subcoroutine which inherits from a parent hlua ctx.
In this case, no yield attempt will be performed, we will simply check
for hlua timeouts.

Not doing so would result in the timeout checks not being performed since
hlua_ctx_resume() is completely bypassed when yielding from the subroutine,
resulting in a user-defined coroutine potentially going rogue unnoticed.

CLEANUP: hlua: avoid confusion between internal timers and tick based timers

Not all hlua "time" variables use the same time logic.

hlua->wake_time relies on ticks since its meant to be used in conjunction
with task scheduling. Thus, it should be stored as a signed int and
manipulated using the tick api.
Adding a few comments about that to prevent mixups with hlua internal
timer api which doesn't rely on the ticks api.

MEDIUM: hlua: introduce tune.lua.burst-timeout

The "burst" execution timeout applies to any Lua handler.
If the handler fails to finish or yield before timeout is reached,
handler will be aborted to prevent thread contention, to prevent
traffic from not being served for too long, and ultimately to prevent
the process from crashing because of the watchdog kicking in.

Default value is 1000ms.
Combined with forced-yield default value of 10000 lua instructions, it
should be high enough to prevent any existing script breakage, while
still being able to catch slow lua converters or sample fetches
doing thread contention and risking the process stability.

Setting value to 0 completely bypasses this check. (not recommended but
could be required to restore original behavior if this feature breaks
existing setups somehow...)

No backport needed, although it could be used to prevent watchdog crashes
due to poorly coded (slow/cpu consuming) lua sample fetches/converters.

MEDIUM: hlua: reliable timeout detection

For non yieldable lua handlers (converters, fetches or yield
incompatible lua functions), current timeout detection relies on now_ms
thread local variable.

But within non-yieldable contexts, now_ms won't be updated if not by us
(because we're momentarily stuck in lua context so we won't
re-enter the polling loop, which is responsible for clock updates).

To circumvent this, clock_update_date(0, 1) was manually performed right
before now_ms is being read for the timeout checks.

But this fails to work consistently, because if no other concurrent
threads periodically run clock_update_global_date(), which do happen if
we're the only active thread (nbthread=1 or low traffic), our
clock_update_date() call won't reliably update our local now_ms variable

Moreover, clock_update_date() is not the right tool for this anyway, as
it was initially meant to be used from the polling context.
Using it could have negative impact on other threads relying on now_ms
to be stable. (because clock_update_date() performs global clock update
from time to time)

-> Introducing hlua multipurpose timer, which is internally based on
now_cpu_time_fast() that provides per-thread consistent clock readings.

Thanks to this new hlua timer API, hlua timeout logic is less error-prone
and more robust.

This allows the timeout detection to work as expected for both yieldable
and non-yieldable lua handlers.

This patch depends on commit "MINOR: clock: add now_cpu_time_fast() function"

While this could theorically be backported to all stable versions,
it is advisable to avoid backports unless we're confident enough
since it could cause slight behavior changes (timing related) in
existing setups.

MINOR: clock: add now_cpu_time_fast() function

Same as now_cpu_time(), but for fast queries (less accurate)
Relies on now_cpu_time() and now_mono_time_fast() is used
as a cache expiration hint to prevent now_cpu_time() from being
called too often since it is known to be quite expensive.

Depends on commit "MINOR: clock: add now_mono_time_fast() function"

MINOR: clock: add now_mono_time_fast() function

Same as now_mono_time(), but for fast queries (less accurate)
Relies on coarse clock source (also known as fast clock source on
some systems).

Fallback to now_mono_time() if coarse source is not supported on the system.

BUG/MINOR: cfgparse: make sure to include openssl-compat

Commit 5003ac7fe ("MEDIUM: config: set useful ALPN defaults for HTTPS
and QUIC") revealed a build dependency bug: if QUIC is not enabled,
cfgparse doesn't have any dependency on the SSL stack, so the various
ifdefs that try to check special conditions such as rejecting an H2
config with too small a bufsize, are silently ignored. This was
detected because the default ALPN string was not set and caused the
alpn regtest to fail without QUIC support. Adding openssl-compat to
the list of includes seems to be sufficient to have what we need.

It's unclear when this dependency was broken, it seems that even 2.2
didn't have an explicit dependency on anything SSL-related, though it
could have been inherited through other files (as happens with QUIC
here). It would be safe to backport it to all stable branches. The
impact is very low anyway.

BUG/MEDIUM: quic: prevent crash on Retry sending

The following commit introduced a regression :
commit 1a5cc19cecfa84bfd0fdd7b9d5d20899cce40662
MINOR: quic: adjust Rx packet type parsing

Since this commit, qv variable was left to NULL as version is stored
directly in quic_rx_packet instance. In most cases, this only causes
traces to skip version printing. However, qv is dereferenced when
sending a Retry which causes a segfault.

To fix this, simply remove qv variable and use pkt->version instead,
both for traces and send_retry() invocation.

This bug was detected thanks to QUIC interop runner. It can easily be
reproduced by using quic-force-retry on the bind line.

This must be backported up to 2.7.

MEDIUM: config: set useful ALPN defaults for HTTPS and QUIC

This commit makes sure that if three is no "alpn", "npn" nor "no-alpn"
setting on a "bind" line which corresponds to an HTTPS or QUIC frontend,
we automatically turn on "h2,http/1.1" as an ALPN default for an HTTP
listener, and "h3" for a QUIC listener. This simplifies the configuration
for end users since they won't have to explicitly configure the ALPN
string to enable H2, considering that at the time of writing, HTTP/1.1
represents less than 7% of the traffic on large infrastructures. The
doc and regtests were updated. For more info, refer to the following
thread:

https://www.mail-archive.com/haproxy@formilux.org/msg43410.html

MINOR: ssl_crtlist: dump "no-alpn" on "show crtlist" when "no-alpn" was set

Instead of dumping "alpn " better show "no-alpn" as configured.

MINOR: ssl: do not set ALPN callback with the empty string

While it does not have any effect, it's better not to try to setup an
ALPN callback nor to try to lookup algorithms when the configured ALPN
string is empty as a result of "no-alpn" being used.

DOC: add missing documentation for "no-alpn" on bind lines

This is the doc for the no-alpn keyword that was mistakenly left out of
commit 158c18e85 ("MINOR: config: add "no-alpn" support for bind lines").

REGTESTS: add a new "ssl_alpn" test to test ALPN negotiation

This teg-test verifies that different ALPN values on the "server" line
will negotiate the expected protocol depending on the ALPN "bind" line.

MINOR: config: add "no-alpn" support for bind lines

It's possible to replace a previously set ALPN but not to disable ALPN
if it was previously set. The new "no-alpn" setting allows to disable
a previously set ALPN setting by preparing an empty one that will be
replaced and freed when the config is validated.

BUG/MEDIUM: stconn: Propagate error on the SC on sending path

On sending path, a pending error can be promoted to a terminal error at the
endpoint level (SE_FL_ERR_PENDING to SE_FL_ERROR). When this happens, we
must propagate the error on the SC to be able to handle it at the stream
level and eventually forward it to the other side.

Because of this bug, it is possible to freeze sessions, for instance on the
CLI.

It is a 2.8-specific issue. No backport needed.

CLEANUP: cli: Remove useless debug message in cli_io_handler()

When compiled in debug mode, HAProxy prints a debug message at the end of
the cli I/O handle. It is pretty annoying and useless because, we can active
applet traces. Thus, just remove it.

CLEANUP: backend: Remove useless debug message in assign_server()

When compiled in debug mode, HAProxy prints a debug message at the beginning
of assign_server(). It is pretty annoying and useless because, in debug
mode, we can active stream traces. Thus, just remove it.

BUG/MINOR: http-ana: Update analyzers on both sides when switching in TUNNEL mode

The commit 9704797fa ("BUG/MEDIUM: http-ana: Properly switch the request in
tunnel mode on upgrade") fixes the switch in TUNNEL mode, but only
partially. Because both channels are switch in TUNNEL mode in same time on
one side, the channel's analyzers on the opposite side are not updated
accordingly. This prevents the tunnel timeout to be applied.

So instead of updating both sides in same time, we only force the analysis
on the other side by setting CF_WAKE_ONCE flag when a channel is switched in
TUNNEL mode. In addition, we must take care to forward all data if there is
no DATAa TCP filters registered.

This patch is related to the issue #2125. It is 2.8-specific. No backport
needed.

MINOR: listener: remove unneeded local accept flag

Remove the receiver RX_F_LOCAL_ACCEPT flag. This was used by QUIC
protocol before thread rebinding was supported by the quic_conn layer.

This should be backported up to 2.7 after the previous patch has also
been taken.

MAJOR: quic: support thread balancing on accept

Before this patch, QUIC protocol used a custom add_listener callback.
This was because a quic_conn instance was allocated before accept. Its
thread affinity was fixed and could not be changed after. The thread was
derived itself from the CID selected by the client which prevent an even
repartition of QUIC connections on multiple threads.

A series of patches was introduced with a lot of changes. The most
important ones :
* removal of affinity between an encoded CID and a thread
* possibility to rebind a quic_conn on a new thread

Thanks to this, it's possible to suppress the custom add_listener
callback. Accept is conducted for QUIC protocol as with the others. A
less loaded thread is selected on listener_accept() and the connection
stack is bind on it. This operation implies that quic_conn instance is
moved to the new thread using the set_affinity QUIC protocol callback.

To reactivate quic_conn instance after thread rebind,
qc_finalize_affinity_rebind() is called after accept on the new thread
by qc_xprt_start() through accept_queue_process() / session_accept_fd().

This should be backported up to 2.7 after a period of observation.

MINOR: quic: properly finalize thread rebinding

When a quic_conn instance is rebinded on a new thread its tasks and
tasklet are destroyed and new ones created. Its socket is also migrated
to a new thread which stop reception on it.

To properly reactivate a quic_conn after rebind, wake up its tasks and
tasklet if they were active before thread rebind. Also reactivate
reading on the socket FD. These operations are implemented on a new
function qc_finalize_affinity_rebind().

This should be backported up to 2.7 after a period of observation.

BUG/MINOR: quic: transform qc_set_timer() as a reentrant function

qc_set_timer() function is used to rearm the timer for loss detection
and probing. Previously, timer was always rearm when congestion window
was free due to a wrong interpretation of the RFC which mandates the
client to rearm the timer before handshake completion to avoid a
deadlock related to anti-amplification.

Fix this by removing this code from quic_pto_pktns(). This allows
qc_set_timer() to be reentrant and only activate the timer if needed.

The impact of this bug seems limited. It can probably caused the timer
task to be processed too frequently which could caused too frequent
probing.

This change will allow to reuse easily qc_set_timer() after quic_conn
thread migration. As such, the new timer task will be scheduled only if
needed.

This should be backported up to 2.6.

MEDIUM: quic: implement thread affinity rebinding

Implement a new function qc_set_tid_affinity(). This function is
responsible to rebind a quic_conn instance to a new thread.

This operation consists mostly of releasing existing tasks and tasklet
and allocating new instances on the new thread. If the quic_conn uses
its owned socket, it is also migrated to the new thread. The migration
is finally completed with updated the CID TID to the new thread. After
this step, the connection is thus accessible to the new thread and
cannot be access anymore on the old one without risking race condition.

To ensure rebinding is either done completely or not at all, tasks and
tasklet are pre-allocated before all operations. If this fails, an error
is returned and rebiding is not done.

To destroy the older tasklet, its context is set to NULL before wake up.
In I/O callbacks, a new function qc_process() is used to check context
and free the tasklet if NULL.

The thread rebinding can cause a race condition if the older thread
quic_dghdlrs::dgrams list contains datagram for the connection after
rebinding is done. To prevent this, quic_rx_pkt_retrieve_conn() always
check if the packet CID is still associated to the current thread or
not. In the latter case, no connection is returned and the new thread is
returned to allow to redispatch the datagram to the new thread in a
thread-safe way.

This should be backported up to 2.7 after a period of observation.

MINOR: quic: delay post handshake frames after accept

When QUIC handshake is completed on our side, some frames are prepared
to be sent :
* HANDSHAKE_DONE
* several NEW_CONNECTION_ID with CIDs allocated

This step was previously executed in quic_conn_io_cb() directly after
CRYPTO frames parsing. This patch delays it to be completed after
accept. Special care have been taken to ensure it is still functional
with 0-RTT activated.

For the moment, this patch should have no impact. However, when
quic_conn thread migration on accept will be implemented, it will be
easier to remap only one CID to the new thread. New CIDs will be
allocated after migration on the new thread.

This should be backported up to 2.7 after a period of observation.

MINOR: protocol: define new callback set_affinity

Define a new protocol callback set_affinity. This function is used
during listener_accept() to notify about a rebind on a new thread just
before pushing the connection on the selected thread queue. If the
callback fails, accept is done locally.

This change will be useful for protocols with state allocated before
accept is done. For the moment, only QUIC protocol is concerned. This
will allow to rebind the quic_conn to a new thread depending on its
load.

This should be backported up to 2.7 after a period of observation.

MINOR: quic: do not proceed to accept for closing conn

Each quic_conn is inserted in an accept queue to allocate the upper
layers. This is done through a listener tasklet in
quic_sock_accept_conn().

This patch interrupts the accept process for a quic_conn in
closing/draining state. Indeed, this connection will soon be closed so
it's unnecessary to allocate a complete stack for it.

This patch will become necessary when thread migration is implemented.
Indeed, it won't be allowed to proceed to thread migration for a closing
quic_conn.

This should be backported up to 2.7 after a period of observation.

MEDIUM: quic: handle conn bootstrap/handshake on a random thread

TID encoding in CID was removed by a recent change. It is now possible
to access to the <tid> member stored in quic_connection_id instance.

For unknown CID, a quick solution was to redispatch to the thread
corresponding to the first CID byte. This ensures that an identical CID
will always be handled by the same thread to avoid creating multiple
same connection. However, this forces an uneven load repartition which
can be critical for QUIC handshake operation.

To improve this, remove the above constraint. An unknown CID is now
handled by its receiving thread. However, this means that if multiple
packets are received with the same unknown CID, several threads will try
to allocate the same connection.

To prevent this race condition, CID insertion in global tree is now
conducted first before creating the connection. This is a thread-safe
operation which can only be executed by a single thread. The thread
which have inserted the CID will then proceed to quic_conn allocation.
Other threads won't be able to insert the same CID : this will stop the
treatment of the current packet which is redispatch to the now owning
thread.

This should be backported up to 2.7 after a period of observation.

MINOR: quic: remove TID encoding in CID

CIDs were moved from a per-thread list to a global list instance. The
TID-encoded is thus non needed anymore.

This should be backported up to 2.7 after a period of observation.

MEDIUM: quic: use a global CID trees list

Previously, quic_connection_id were stored in a per-thread tree list.
Datagram were first dispatched to the correct thread using the encoded
TID before a tree lookup was done.

Remove these trees and replace it with a global trees list of 256
entries. A CID is using the list index corresponding to its first byte.
On datagram dispatch, CID is lookup on its tree and TID is retrieved
using new member quic_connection_id.tid. As such, a read-write lock
protects each list instances. With 256 entries, it is expected that
contention should be reduced.

A new structure quic_cid_tree served as a tree container associated with
its read-write lock. An API is implemented to ensure lock safety for
insert/lookup/delete operation.

This patch is a step forward to be able to break the affinity between a
CID and a TID encoded thread. This is required to be able to migrate a
quic_conn after accept to select thread based on their load.

This should be backported up to 2.7 after a period of observation.

MINOR: quic: remove TID ref from quic_conn

Remove <tid> member in quic_conn. This is moved to quic_connection_id
instance.

For the moment, this change has no impact. Indeed, qc.tid reference
could easily be replaced by tid as all of this work was already done on
the connection thread. However, it is planified to support quic_conn
thread migration in the future, so removal of qc.tid will simplify this.

This should be backported up to 2.7.

MINOR: quic: adjust quic CID derive API

ODCID are never stored in the CID tree. Instead, we store our generated
CID which is directly derived from the CID using a hash function. This
operation is done via quic_derive_cid().

Previously, generated CID was returned as a 64-bits integer. However,
this is cumbersome to convert as an array of bytes which is the most
common CID representation. Adjust this by modifying return type to a
quic_cid struct.

This should be backported up to 2.7.

MINOR: quic: adjust Rx packet type parsing

qc_parse_hd_form() is the function used to parse the first byte of a
packet and return its type and version. Its API has been simplified with
the following changes :
* extra out paremeters are removed (long_header and version). All infos
  are now stored directly in quic_rx_packet instance
* a new dummy version is declared in quic_versions array with a 0 number
  code. This can be used to match Version negotiation packets.
* a new default packet type is defined QUIC_PACKET_TYPE_UNKNOWN to be
  used as an initial value.

Also, the function has been exported to an include file. This will be
useful to be able to reuse on quic-sock to parse the first packet of a
datagram.

This should be backported up to 2.7.

MINOR: quic: remove uneeded tasklet_wakeup after accept

No need to explicitely wakeup quic-conn tasklet after accept is done.

This should be backported up to 2.7.

CLEANUP: quic: rename quic_connection_id vars

Two different structs exists for QUIC connection ID :
* quic_connection_id which represents a full CID with its sequence
number
* quic_cid which is just a buffer with a length. It is contained in the
above structure.

To better differentiate them, rename all quic_connection_id variable
instances to "conn_id" by contrast to "cid" which is used for quic_cid.

This should be backported up to 2.7.

CLEANUP: quic: remove unused qc param on stateless reset token

Remove quic_conn instance as first parameter of
quic_stateless_reset_token_init() and quic_stateless_reset_token_cpy()
functions. It was only used for trace purpose.

The main advantage is that it will be possible to allocate a QUIC CID
without a quic_conn instance using new_quic_cid() which is requires to
first check if a CID is existing before allocating a connection.

This should be backported up to 2.7.

CLEANUP: quic: remove unused scid_node

Remove unused scid_node member for quic_conn structure. It was prepared
for QUIC backend support.

This should be backported up to 2.7.

CLEANUP: quic: remove unused QUIC_LOCK label

QUIC_LOCK label is never used. Indeed, lock usage is minimal on QUIC as
every connection is pinned to its owned thread.

This should be backported up to 2.7.

BUG/MINOR: task: allow to use tasklet_wakeup_after with tid -1

Adjust BUG_ON() statement to allow tasklet_wakeup_after() for tasklets
with tid pinned to -1 (the current thread). This is similar to
tasklet_wakeup().

This should be backported up to 2.6.

MINOR: mux-h2: make the max number of concurrent streams configurable per side

For a long time the maximum number of concurrent streams was set once for
both sides (front and back) while the impacts are different. This commit
allows it to be configured separately for each side. The older settings
remains the fallback choice when other ones are not set.

MINOR: mux-h2: make the initial window size configurable per side

For a long time the initial window size (per-stream size) was set once
for both directions, frontend and backend, resulting in a tradeoff between
upload speed and download fairness. This commit allows it to be configured
separately for each side. The older settings remains the fallback choice
when other ones are not set.

MINOR: stconn: Propagate EOS from an applet to the attached stream-connector

In the same way than for a stream-connector attached to a mux, an EOS is now
propagated from an applet to its stream-connector. To do so, sc_applet_eos()
function is added.

MINOR: stconn: Propagate EOS from a mux to the attached stream-connector

Now there is a SC flag to state the endpoint has reported an end-of-stream,
it is possible to distinguish an EOS from an abort at the stream-connector
level.

sc_conn_read0() function is renamed to sc_conn_eos() and it propagates an
EOS by setting SC_FL_EOS instead of SC_FL_ABRT_DONE. It only concernes
stream-connectors attached to a mux.

MINOR: stconn: Add a flag to report EOS at the stream-connector level

SC_FL_EOS flag is added to report the end-of-stream at the SC level. It will
be used to distinguish end of stream reported by the endoint, via the
SE_FL_EOS flag, and the abort triggered by the stream, via the
SC_FL_ABRT_DONE flag.

In this patch, the flag is defined and is systematically tested everywhere
SC_FL_ABRT_DONE is tested. It should be safe because it is never set.

BUG/MEDIUM: log: Properly handle client aborts in syslog applet

In the syslog applet, when there is no output data, nothing is performed and
the applet leaves by requesting more data. But it is an issue because a
client abort is only handled if it reported with the last bytes of the
message. If the abort occurs after the message was handled, it is ignored.
The session remains opened and inactive until the client timeout is being
triggered. It no such timeout is configured, given that the default maxconn
is 10, all slots can be quickly busy and make the applet unresponsive.

To fix the issue, the best is to always try to read a message when the I/O
handle is called. This way, the abort can be handled. And if there is no
data, we leave as usual.

This patch should fix the issue #2112. It must be backported as far as 2.4.

BUG/MEDIUM: http-ana: Properly switch the request in tunnel mode on upgrade

Since the commit f2b02cfd9 ("MAJOR: http-ana: Review error handling during
HTTP payload forwarding"), during the payload forwarding, we are analyzing a
side, we stop to test the opposite side. It means when the HTTP request
forwarding analyzer is called, we no longer check the response side and vice
versa.

Unfortunately, since then, the HTTP tunneling is broken after a protocol
upgrade. On the response is switch in TUNNEL mode. The request remains in
DONE state. As a consequence, data received from the server are forwarded to
the client but not data received from the client.

To fix the bug, when both sides are in DONE state, both are switched in same
time in TUNNEL mode if it was requested. It is performed in the same way in
http_end_request() and http_end_response().

This patch should fix the issue #2125. It is 2.8-specific. No backport
needed.

MINOR: ssl: remove OpenSSL 1.0.2 mention into certificate loading error

Remove the mention to OpenSSL 1.0.2 in the certificate chain loading
error, which is not relevant.

Could be backported in 2.7.

CLEANUP: use "offsetof" where appropriate

let's use the C library macro "offsetof"

BUG/MINOR: quic: Do not use ack delay during the handshakes

As revealed by GH #2120 opened by @Tristan971, there are cases where ACKs
have to be sent without packet to acknowledge because the ACK timer has
been triggered and the connection needs to probe the peer at the same time.
Indeed

Thank you to @Tristan971 for having reported this issue.

Must be backported to 2.6 and 2.7.

BUG/MINOR: stconn: Don't set SE_FL_ERROR at the end of sc_conn_send()

When I reworked my series, this code was first removed and reinserted by
error. So let's remove it again.

MEDIUM: stconn: Rely on SC flags to handle errors instead of SE flags

It is the last commit on this subject. we stop to use SE_FL_ERROR flag from
the SC, except at the I/O level. Otherwise, we rely on SC_FL_ERROR
flag. Now, there should be a real separation between SE flags and SC flags.