git.ipfire.org Git - thirdparty/haproxy.git/log

MINOR: counters: Review conditions to increment counters from analysers

Now, for these counters, the following rules are followed to know if it must be
incremented or not:

  * if it exists for a frontend, the counter is incremented
  * if stats must be collected for the session's listener, if the counter exists
    for this listener, it is incremented
  * if the backend is already assigned, if the counter exists for this backend,
    it is incremented
  * if a server is attached to the stream, if the counter exists for this
    server, it is incremented

It is not hardcoded rules. Some counters are still handled in a different
way. But many counters are incremented this way now.

MINOR: counters: Remove failed_secu counter and use denied_resp instead

The failed_secu counter is only used for the servers stats. It is used to report
the number of denied responses. On proxies, the same info is stored in the
denied_resp counter. So, it is more consistent to use the same field for
servers.

MINOR: contrib/prometheus-exporter: Export internal errors per proxy/server

The new ST_F_EINT stats field is now exported for each proxy/server.

MINOR: stats: Report internal errors in the proxies/listeners/servers stats

The stats field ST_F_EINT has been added to report internal errors encountered
per proxy, per listener and per server. It appears in the CLI export and on the
HTML stats page.

MINOR: http-rules: Handle denied/aborted/invalid connections from HTTP rules

The new possible results for a custom action (deny/abort/invalid) are now handled
during HTTP rules evaluation. These codes are mapped on HTTP rules ones :

  * ACT_RET_DENY => HTTP_RULE_RES_DENY
  * ACT_RET_ABRT => HTTP_RULE_RES_ABRT
  * ACT_RET_INV  => HTTP_RULE_RES_BADREQ

For now, no custom action uses these new codes.

MINOR: tcp-rules: Handle denied/aborted/invalid connections from TCP rules

The new possible results for a custom action (deny/abort/invalid) are now
handled during TCP rules evaluation. For L4/L5 rules, the session is
rejected. For L7 rules, the right counter is incremented, then the connections
killed.

For now, no custom action uses these new codes.

MINOR: http-rules: Add more return codes to let custom actions act as normal ones

When HTTP/TCP rules are evaluated, especially HTTP ones, some results are
possible for normal actions and not for custom ones. So missing return codes
(ACT_RET_) have been added to let custom actions act as normal ones. Concretely
following codes have been added:

* ACT_RET_DENY : deny the request/response. It must be handled by the caller
* ACT_RET_ABRT : abort the request/response, handled by action itsleft.
* ACT_RET_INV : invalid request/response

MINOR: http-rules: Handle internal errors during HTTP rules evaluation

The HTTP_RULE_RES_ERROR code is now used by HTTP analyzers to handle internal
errors during HTTP rules evaluation. It is used instead of HTTP_RULE_RES_BADREQ,
used for invalid requests/responses. In addition, the SF_ERR_RESOURCE flag is
set on the stream when an allocation failure happens.

Note that the return value of http-response rules evaluation is now tested in
the same way than the result of http-request rules evaluation.

MINOR: http-rules: Add a rule result to report internal error

Now, when HTTP rules are evaluated, HTTP_RULE_RES_ERROR must be returned when an
internal error is catched. It is a way to make the difference between a bad
request or a bad response and an error during its processing.

MEDIUM: http-ana: Properly handle internal processing errors

Now, processing errors are properly handled. Instead of returning an error 400
or 502, depending where the error happens, an error 500 is now returned. And the
processing_errors counter is incremented. By default, when such error is
detected, the SF_ERR_INTERNAL stream error is used. When the error is caused by
an allocation failure, and when it is reasonnably possible, the SF_ERR_RESOURCE
stream error is used. Thanks to this patch, bad requests and bad responses
should be easier to detect.

MINOR: counters: Add a counter to report internal processing errors

This counter, named 'internal_errors', has been added in frontend and backend
counters. It should be used when a internal error is encountered, instead for
failed_req or failed_resp.

MINOR: http-rules: Return an error when custom actions return ACT_RET_ERR

Thanks to the commit "MINOR: actions: Use ACT_RET_CONT code to ignore an error
from a custom action", it is now possible to trigger an error from a custom
action in http rules. Now, when a custom action returns the ACT_RET_ERR code
from an http-request rule, an error 400 is returned. And from an http-response
rule, an error 502 is returned.

Be careful if this patch is backported. The other mentioned patch must be
backported first.

MINOR: tcp-rules: Kill connections when custom actions return ACT_RET_ERR

Thanks to the commit "MINOR: actions: Use ACT_RET_CONT code to ignore an error
from a custom action", it is now possible to trigger an error from a custom
action in tcp-content rules. Now, when a custom action returns the ACT_RET_ERR
code, it has the same behavior than a reject rules, the connection is killed.

Be careful if this patch is backported. The other mentioned patch must be
backported first.

MINOR: actions: Use ACT_RET_CONT code to ignore an error from a custom action

Some custom actions are just ignored and skipped when an error is encoutered. In
that case, we jump to the next rule. To do so, most of them use the return code
ACT_RET_ERR. Currently, for http rules and tcp content rules, it is not a
problem because this code is handled the same way than ACT_RET_CONT. But, it
means there is no way to handle the error as other actions. The custom actions
must handle the error and return ACT_RET_DONE. For instance, when http-request
rules are processed, an error when we try to replace a header value leads to a
bad request and an error 400 is returned to the client. But when we fail to
replace the URI, the error is silently ignored. This difference between the
custom actions and the others is an obstacle to write new custom actions.

So, in this first patch, ACT_RET_CONT is now returned from custom actions
instead of ACT_RET_ERR when an error is encoutered if it should be ignored. The
behavior remains the same but it is now possible to handle true errors using the
return code ACT_RET_ERR. Some actions will probably be reviewed to determine if
an error is fatal or not. Other patches will be pushed to trigger an error when
a custom action returns the ACT_RET_ERR code.

This patch is not tagged as a bug because it is just a design issue. But others
will depends on it. So be careful during backports, if so.

MINOR: tcp-rules: Always set from which ruleset a rule comes from

The ruleset from which a TCP rule comes from (the <from> field in the act_rule
structure) is only set when a rule is created from a registered keyword and not
for all TCP rules. But this information may be useful to check the configuration
validity or during the rule evaluation. So now, we systematically set it.

MEDIUM: http-rules: Register an action keyword for all http rules

There are many specific http actions that don't use the action registration
mechanism (allow, deny, set-header...). Instead, the parsing of these actions is
inlined in the functions responsible to parse the http-request/http-response
rules. There is no reason to not register an action keyword for all these
actions. It it the purpose of this patch. The new functions responsible to parse
these http actions are defined in http_act.c

BUG/MINOR: stick-table: Use MAX_SESS_STKCTR as the max track ID during parsing

During the parsing of the sc-inc-gpc0, sc-inc-gpc1 and sc-inc-gpt1 actions, the
maximum stick table track ID allowed is tested against ACT_ACTION_TRK_SCMAX. It
is the action number and not the maximum number of stick counters. Instead,
MAX_SESS_STKCTR must be used.

This patch must be backported to all stable versions.

BUG/MINOR: http-rules: Remove buggy deinit functions for HTTP rules

Functions to deinitialize the HTTP rules are buggy. These functions does not
check the action name to release the right part in the arg union. Only few info
are released. For auth rules, the realm is released and there is no problem
here. But the regex <arg.hdr_add.re> is always unconditionally released. So it
is easy to make these functions crash. For instance, with the following rule
HAProxy crashes during the deinit :

http-request set-map(/path/to/map) %[src] %[req.hdr(X-Value)]

For now, These functions are simply removed and we rely on the deinit function
used for TCP rules (renamed as deinit_act_rules()). This patch fixes the
bug. But arguments used by actions are not released at all, this part will be
addressed later.

This patch must be backported to all stable versions.

BUG/MINOR: http-ana/filters: Wait end of the http_end callback for all filters

Filters may define the "http_end" callback, called at the end of the analysis of
any HTTP messages. It is called at the end of the payload forwarding and it can
interrupt the stream processing. So we must be sure to not remove the XFER_BODY
analyzers while there is still at least filter in progress on this callback.

Unfortunatly, once the request and the response are borh in the DONE or the
TUNNEL mode, we consider the XFER_BODY analyzer has finished its processing on
both sides. So it is possible to prematurely interrupt the execution of the
filters "http_end" callback.

To fix this bug, we switch a message in the ENDING state. It is then switched in
DONE/TUNNEL mode only after the execution of the filters "http_end" callback.

This patch must be backported (and adapted) to 2.1, 2.0 and 1.9. The legacy HTTP
mode shoud probaly be fixed too.

MINOR: contrib/prometheus-exporter: Add heathcheck status/code in server metrics

ST_F_CHECK_STATUS and ST_F_CHECK_CODE are now part of exported server metrics:

* haproxy_server_check_status
* haproxy_server_check_code

The heathcheck status is an integer corresponding to HCHK_STATUS value.

MINOR: mux-h1: Inherit send flags from the upper layer

Send flags (CO_SFL_*) used when xprt->snd_buf() is called, in h1_send(), are now
inherited from the upper layer, when h1_snd_buf() is called. First, the flag
CO_SFL_MSG_MORE is no more set if the output buffer is full, but only if the
stream-interface decides to set it. It has more info to do it than the
mux. Then, the flag CO_SFL_STREAMER is now also handled this way. It was just
ignored till now.

DOC: Add a section to document the internal sample fetches

The section 7.3.7. is now dedicated to internal sample fetches. For now, only
HTX sample fetches are referenced in this section. But it should contain the
documentation of all sample fetches reserved to an internal use, for debugging
or testing purposes.

MINOR: http-htx: Make 'internal.htx_blk_data' return a binary string

This internal sample fetch now returns a binary string (SMP_T_BIN) instead of a
character string.

MINOR: http-htx: Rename 'internal.htx_blk.val' to 'internal.htx_blk.data'

Use a more explicit name for this internal sample fetch.

MINOR: http-htx: Move htx sample fetches in the scope "internal"

HTX sample fetches are now prefixed by "internal." to explicitly reserve their
uses for debugging or testing purposes.

BUG/MINOR: 51d: Fix bug when HTX is enabled

When HTX is enabled, the sample flags were set too early. When matching for
multiple HTTP headers, the sample is fetched more than once, meaning that the
flags would need to be set again. Instead, the flags are now set last (just
before the outermost function returns). This could be further improved by
passing around the message without calling prefetch again.

This patch must be backported as far as 1.9. it should fix bug #450.

BUG/MINOR: dns: Make dns_query_id_seed unsigned

Left shifting of large signed values and negative values is undefined.

In a test script clang's ubsan rightfully complains:

> runtime error: left shift of 1934242336581872173 by 13 places cannot be represented in type 'int64_t' (aka 'long')

This bug was introduced in the initial version of the DNS resolver
in 325137d603aa81bd24cbd8c99d816dd42291daa7. The fix must be backported
to HAProxy 1.6+.

BUG/MINOR: cache: Fix leak of cache name in error path

This issue was introduced in commit 99a17a2d91f9044ea20bba6617048488aed80555
which first appeared in tag v1.9-dev11. This bugfix should be backported
to HAProxy 1.9+.

DOC: Fix copy and paste mistake in http-response replace-value doc

This fixes up commit 2252beb8557d73407b8f96eef91d6927fb855685.

MINOR: ssl: Add support for returning the dn samples from ssl_(c|f)_(i|s)_dn in LDAP v3 (RFC2253) format.

Modifies the existing sample extraction methods (smp_fetch_ssl_x_i_dn,
smp_fetch_ssl_x_s_dn) to accommodate a third argument that indicates the
DN should be returned in LDAP v3 format. When the third argument is
present, the new function (ssl_sock_get_dn_formatted) is called with
three parameters including the X509_NAME, a buffer containing the format
argument, and a buffer for the output.  If the supplied format matches
the supported format string (currently only "rfc2253" is supported), the
formatted value is extracted into the supplied output buffer using
OpenSSL's X509_NAME_print_ex and BIO_s_mem. 1 is returned when a dn
value is retrieved.  0 is returned when a value is not retrieved.

Argument validation is added to each of the related sample
configurations to ensure the third argument passed is either blank or
"rfc2253" using strcmp.  An error is returned if the third argument is
present with any other value.

Documentation was updated in configuration.txt and it was noted during
preliminary reviews that a CLEANUP patch should follow that adjusts the
documentation.  Currently, this patch and the existing documentation are
copied with some minor revisions for each sample configuration.  It
might be better to have one entry for all of the samples or entries for
each that reference back to a primary entry that explains the sample in
detail.

Special thanks to Chris, Willy, Tim and Aleks for the feedback.

Author: Elliot Otchet <degroens@yahoo.com>
Reviewed-by: Tim Duesterhus <tim@bastelstu.be>

MINOR: connection: make the last arg of subscribe() a struct wait_event*

The subscriber used to be passed as a "void *param" that was systematically
cast to a struct wait_event*. By now it appears clear that the subscribe()
call at every layer is well defined and always takes a pointer to an event
subscriber of type wait_event, so let's enforce this in the functions'
prototypes, remove the intermediary variables used to cast it and clean up
the comments to clarify what all these functions do in their context.

MEDIUM: mux-fcgi: merge recv_wait and send_wait event notifications

This is the last of the "recv_wait+send_wait merge" patches and is
functionally equivalent to previous commit "MEDIUM: mux-h2: merge
recv_wait and send_wait event notifications" but for FCGI this time.
The principle is pretty much the same, since the code is very similar.
We use a single wait_event for both recv and send and rely on the
subscribe flags to know the desired notifications.

MEDIUM: mux-h2: merge recv_wait and send_wait event notifications

This is the continuation of the recv+send event notifications merge
that was started. This patch is less trivial than the previous ones
because the existence of a send event subscription is also used to
decide to put a stream back into the send list.

MEDIUM: mux-h1: merge recv_wait and send_wait

This is the same principle as previous commit, but for the H1 mux this
time. The checks in the subscribe()/unsubscribe() calls were factored
and some BUG_ON() were added to detect unexpected cases.
h1_wake_for_recv() and h1_wake_for_send() needed to be refined to
consider the current subscription before deciding to wake up.

MEDIUM: ssl: merge recv_wait and send_wait in ssl_sock

This is the same principle as previous commit, but for ssl_sock.

MEDIUM: xprt: merge recv_wait and send_wait in xprt_handshake

This is the same principle as previous commit, but for xprt_handshake.

MEDIUM: connection: merge the send_wait and recv_wait entries

In practice all callers use the same wait_event notification for any I/O
so instead of keeping specific code to handle them separately, let's merge
them and it will allow us to create new events later.

MEDIUM: backend: move the connection finalization step to back_handle_st_con()

Currently there's still lots of code in conn_complete_server() that performs
one half of the connection setup, which is then checked and finalized in
back_handle_st_con(). There isn't a valid reason for this anymore, we can
simplify this and make sure that conn_complete_server() only wakes the stream
up to inform it about the fact the whole connection stack is set up so that
back_handle_st_con() finishes its job at the stream-int level.

It looks like the there could even be further simplified, but for now it
was moved straight out of conn_complete_server() with no modification.

REORG: stream/backend: move backend-specific stuff to backend.c

For more than a decade we've kept all the sess_update_st_*() functions
in stream.c while they're only there to work in relation with what is
currently being done in backend.c (srv_redispatch_connect, connect_server,
etc). Let's move all this pollution over there and take this opportunity
to try to find slightly less confusing names for these old functions
whose role is only to handle transitions from one specific stream-int
state:

  sess_update_st_rdy_tcp() -> back_handle_st_rdy()
  sess_update_st_con_tcp() -> back_handle_st_con()
  sess_update_st_cer()     -> back_handle_st_cer()
  sess_update_stream_int() -> back_try_conn_req()
  sess_prepare_conn_req()  -> back_handle_st_req()
  sess_establish()         -> back_establish()

The last one remained in stream.c because it's more or less a completion
function which does all the initialization expected on a connection
success or failure, can set analysers and emit logs.

The other ones could possibly slightly benefit from being modified to
take a stream-int instead since it's really what they're working with,
but it's unimportant here.

MEDIUM: mux-fcgi: do not make an fstrm subscribe to itself on deferred shut

This is the port to FCGI of previous commit "MEDIUM: mux-h2: do not make
an h2s subscribe to itself on deferred shut".

The purpose is to avoid subscribing to the send_wait list when trying to
close, because we'll soon have to merge both recv and send lists. Basic
testing showed no difference (performance nor issues).

MEDIUM: mux-h2: do not make an h2s subscribe to itself on deferred shut

The logic handling the deferred shutdown is a bit complex because it
involves a wait_event struct in each h2s dedicated to subscribing to
itself when shutdowns are not immediately possible. This implies that
we will not be able to support a shutdown and a receive subscription
in the future when we merge all wait events.

Let's solely rely on the H2_SF_WANT_SHUT_{R,W} flags instead and have
an autonomous tasklet for this. This requires to add a few controls
in the code because now when waking up a stream we need to check if it
is for I/O or just a shut, but since sending and shutting are exclusive
it's not difficult.

One point worth noting is that further resources could be shaved off
by only allocating the tasklet when failing to shut, given that in the
vast majority of streams it will never be used. In fact the sole purpose
of the tasklet is to support calling this code from outside the H2 mux
context. Looking at the code, it seems that not too many adaptations
would be required to have the send_list walking code deal with sending
the shut bits itself and further simplify all this.

MEDIUM: mux-fcgi: do not try to stop sending streams on blocked mux

This is essentially the same change as applied to mux-h2 in previous commit
"MEDIUM: mux-h2: do not try to stop sending streams on blocked mux". The
goal is to make sure we don't need to keep the item in the send_wait list
until it's executed so that we can later merge it with the recv_wait list.
No performance changes were observed.

MEDIUM: mux-h2: do not try to stop sending streams on blocked mux

This partially reverts commit d846c267 ("MINOR: h2: Don't run tasks that
are waiting to send if mux in full"). This commit was introduced to
limit the start/stop overhead incurred by waking many streams to let
only a few work. But since commit 9c218e7521 ("MAJOR: mux-h2: switch
to next mux buffer on buffer full condition."), this situation occurs
way less (typically 2000 to 4000 times less often) and the benefits of
the patch above do not outweigh its shortcomings anymore. And commit
c7ce4e3e7f ("BUG/MEDIUM: mux-h2: don't stop sending when crossing a
buffer boundary") addressed a root cause of many unexpected sleeps and
wakeups.

The main problem it's causing is that it requires to keep the element
in the send_wait list until it's executed, leaving the entry in an
uncertain state, and significantly complicating the coexistence of this
list and the wait list dedicated to shutdown. Also it happens that this
call to tasklet_remove_from_task_list() will not be usable anymore once
we start to support streams on different threads. And finally, some of
the other streams that we remove might very well have managed to find
their way to the h2_snd_buf() with an unblocked condition as well so it
is possible that some of these removals were not welcome.

So this patch now makes sure that send_wait is immediately nulled when
the task is woken up, and that we don't have to play with it afterwards.
Since we don't need to stop the tasklets anymore, we don't need the
sending_list that we can remove.

However one very useful benefit of the sending_list was that it used to
provide the information about the fact that the stream already tried to
send and failed. This was an important factor to improve fairness because
late arrived streams should not be allowed to send if others are already
scheduled. So this patch introduces a new per-stream flag H2_SF_NOTIFIED
to distinguish such streams.

With this patch the fairness is preserved, and the ratio of aborted
h2_snd_buf() due to other streams already sending remains quite low
(~0.3-2.1% measured depending on object size, this is within
expectations for 100 independent streams).

If the contention issue the patch above used to address comes up again
in the future, a much better (though more complicated) solution would
be to switch to per-connection buffer pools to distribute between the
connection and the streams so that by default there are more buffers
available for the mux and the streams only have some when the mux's are
unused, i.e. it would push the memory pressure back to the data layer.

One observation made while developing this patch is that when dealing
with large objects we still spend a huge amount of time scanning the
send_list with tasks that are already woken up every time a send()
manages to purge a bit more data. Indeed, by removing the elements
from the list when H2_SF_NOTIFIED is set, the netowrk bandwidth on
1 MB objects fetched over 100 streams per connection increases by 38%.
This was not done here to preserve fairness but is worth studying (e.g.
by keeping a restart pointer on the list or just having a flag indicating
if an entry was added since last scan).

BUILD: pattern: include errno.h

Commit 3c79d4bdc introduced the use of errno in pattern.c without
including errno.h.
If we build haproxy without any option errno is not defined and the
build fails.

MEDIUM: connection: get rid of CO_FL_CURR_* flags

These ones used to serve as a set of switches between CO_FL_SOCK_* and
CO_FL_XPRT_*, and now that the SOCK layer is gone, they're always a
copy of the last know CO_FL_XPRT_* ones that is resynchronized before
I/O events by calling conn_refresh_polling_flags(), and that are pushed
back to FDs when detecting changes with conn_xprt_polling_changes().

While these functions are not particularly heavy, what they do is
totally redundant by now because the fd_want_*/fd_stop_*() actions
already perform test-and-set operations to decide to create an entry
or not, so they do the exact same thing that is done by
conn_xprt_polling_changes(). As such it is pointless to call that
one, and given that the only reason to keep CO_FL_CURR_* is to detect
changes there, we can now remove them.

Even if this does only save very few cycles, this removes a significant
complexity that has been responsible for many bugs in the past, including
the last one affecting FreeBSD.

All tests look good, and no performance regressions were observed.

MINOR: stream-int: remove dependency on CO_FL_WAIT_ROOM for rcv_buf()

The only case where this made sense was with mux_h1 but Since we
introduced CS_FL_MAY_SPLICE, we don't need to rely on this anymore,
thus we don't need to clear it either when we do not splice.

There is a last check on this flag used to determine if the rx channel
is full and that cannot go away unless it's changed to use the CS
instead but for now this wouldn't add any benefit so better not do
it yet.

MINOR: connection: move the CO_FL_WAIT_ROOM cleanup to the reader only

CO_FL_WAIT_ROOM is set by the splicing function in raw_sock, and cleared
by the stream-int when splicing is disabled, as well as in
conn_refresh_polling_flags() so that a new call to ->rcv_pipe() could
be attempted by the I/O callbacks called from conn_fd_handler(). This
clearing in conn_refresh_polling_flags() makes no sense anymore and is
in no way related to the polling at all.

Since we don't call them from there anymore it's better to clear it
before attempting to receive, and to set it again later. So let's move
this operation where it should be, in raw_sock_to_pipe() so that it's
now symmetric. It was also placed in raw_sock_to_buf() so that we're
certain that it gets cleared if an attempt to splice is replaced with
a subsequent attempt to recv(). And these were currently already achieved
by the call to conn_refresh_polling_flags(). Now it could theorically be
removed from the stream-int.

BUG/MINOR: pattern: handle errors from fgets when trying to load patterns

We need to do some error handling after we call fgets to make sure everything
went fine. If we don't users can be fooled into thinking they can load pattens
from directory because cfgparse doesn't flinch. This applies to acl patterns
map files.

This should be backported to all supported versions.

BUG/MEDIUM: connection: add a mux flag to indicate splice usability

Commit c640ef1a7d ("BUG/MINOR: stream-int: avoid calling rcv_buf() when
splicing is still possible") fixed splicing in TCP and legacy mode but
broke it badly in HTX mode.

What happens in HTX mode is that the channel's to_forward value remains
set to CHN_INFINITE_FORWARD during the whole transfer, and as such it is
not a reliable signal anymore to indicate whether more data are expected
or not. Thus, when data are spliced out of the mux using rcv_pipe(), even
when the end is reached (that only the mux knows about), the call to
rcv_buf() to get the final HTX blocks completing the message were skipped
and there was often no new event to wake this up, resulting in transfer
timeouts at the end of large objects.

All this goes down to the fact that the channel has no more information
about whether it can splice or not despite being the one having to take
the decision to call rcv_pipe() or not. And we cannot afford to call
rcv_buf() inconditionally because, as the commit above showed, this
reduces the forwarding performance by 2 to 3 in TCP and legacy modes
due to data lying in the buffer preventing splicing from being used
later.

The approach taken by this patch consists in offering the muxes the ability
to report a bit more information to the upper layers via the conn_stream.
This information could simply be to indicate that more data are awaited
but the real need being to distinguish splicing and receiving, here
instead we clearly report the mux's willingness to be called for splicing
or not. Hence the flag's name, CS_FL_MAY_SPLICE.

The mux sets this flag when it knows that its buffer is empty and that
data waiting past what is currently known may be spliced, and clears it
when it knows there's no more data or that the caller must fall back to
rcv_buf() instead.

The stream-int code now uses this to determine if splicing may be used
or not instead of looking at the rcv_pipe() callbacks through the whole
chain. And after the rcv_pipe() call, it checks the flag again to decide
whether it may safely skip rcv_buf() or not.

All this bitfield dance remains a bit complex and it starts to appear
obvious that splicing vs reading should be a decision of the mux based
on permission granted by the data layer. This would however increase
the API's complexity but definitely need to be thought about, and should
even significantly simplify the data processing layer.

The way it was integrated in mux-h1 will also result in no more calls
to rcv_pipe() on chunked encoded data, since these ones are currently
disabled at the mux level. However once the issue with chunks+splice
is fixed, it will be important to explicitly check for curr_len|CHNK
to set MAY_SPLICE, so that we don't call rcv_buf() after each chunk.

This fix must be backported to 2.1 and 2.0.

BUG/MINOR: stream: don't mistake match rules for store-request rules

In process_sticking_rules() we only want to apply the first store-request
rule for a given table, but when doing so we need to make sure we only
count actual store-request rules when we list the sticking rules.

Failure to do so leads to not being able to write store-request and match
sticking rules in any order as a match rule after a store-request rule
will be ignored.

The following configuration reproduces the issue:

  global
    stats socket /tmp/foobar

  defaults
    mode http

  frontend in
    bind *:8080
    default_backend bar

  backend bar
    server s1 127.0.0.1:21212
    server s2 127.0.0.1:21211
    stick store-request req.hdr(foo)
    stick match req.hdr(foo)
    stick-table type string size 10

  listen foo
    bind *:21212
    bind *:21211
    http-request deny deny_status 200 if { dst_port 21212 }
    http-request deny

This patch fixes issue #448 and should be backported as far as 1.6.

CLEANUP: cli: deduplicate the code in _getsocks

Since the fix 5fd3b28 ("BUG/MEDIUM: cli: _getsocks must send the peers
sockets") for bug #443. The code which sends the socket for the peers
and the proxies is duplicated. This patch move this code in a separated
function.

BUG/MEDIUM: cli: _getsocks must send the peers sockets

This bug prevents to reload HAProxy when you have both the seamless
reload (-x / expose-fd listeners) and the peers.

Indeed the _getsocks command does not send the FDs of the peers
listeners, so if no reuseport is possible during the bind, the new
process will fail to bind and exits.

With this feature, it is not possible to fallback on the SIGTTOU method
if we didn't receive all the sockets, because you can't close() the
sockets of the new process without closing those of the previous
process, they are the same.

Should fix bug #443.

Must be backported as far as 1.8.

REGTEST: add sample_fetches/hashes.vtc to validate hashes

This regtest validates all hashes that we support, on all input bytes from
0x00 to 0xFF. Those supporting avalanche are tested as well. It also tests
len(), hex() and base64(). It purposely does not enable sha2() because this
one relies on OpenSSL and there's no point in validating that OpenSSL knows
how to hash, what matters is that we can test our hashing functions in all
cases. However since the tests were written, they're still present and
commented out in case that helps.

It may be backported to supported versions, possibly dropping a few algos
that were not supported (e.g. crc32c requires 1.9 minimum).

Note that this test will fail on crc32/djb2/sdbm/wt6 unless patches
"BUG/MINOR: stream: init variables when the list is empty" and
"BUG/MAJOR: hashes: fix the signedness of the hash inputs" are included.

BUG/MAJOR: hashes: fix the signedness of the hash inputs

Wietse Venema reported in the thread below that we have a signedness
issue with our hashes implementations: due to the use of const char*
for the input key that's often text, the crc32, sdbm, djb2, and wt6
algorithms return a platform-dependent value for binary input keys
containing bytes with bit 7 set. This means that an ARM or PPC
platform will hash binary inputs differently from an x86 typically.
Worse, some algorithms are well defined in the industry (like CRC32)
and do not provide the expected result on x86, possibly causing
interoperability issues (e.g. a user-agent would fail to compare the
CRC32 of a message body against the one computed by haproxy).

Fortunately, and contrary to the first impression, the CRC32c variant
used in the PROXY protocol processing is not affected. Thus the impact
remains very limited (the vast majority of input keys are text-based,
such as user-agent headers for exmaple).

This patch addresses the issue by fixing all hash functions' prototypes
(even those not affected, for API consistency). A reg test will follow
in another patch.

The vast majority of users do not use these hashes. And among those
using them, very few will pass them on binary inputs. However, for the
rare ones doing it, this fix MAY have an impact during the upgrade. For
example if the package is upgraded on one LB then on another one, and
the CRC32 of a binary input is used as a stick table key (why?) then
these CRCs will not match between both nodes. Similarly, if
"hash-type ... crc32" is used, LB inconsistency may appear during the
transition. For this reason it is preferable to apply the patch on all
nodes using such hashes at the same time. Systems upgraded via their
distros will likely observe the least impact since they're expected to
be upgraded within a short time frame.

And it is important for distros NOT to skip this fix, in order to avoid
distributing an incompatible implementation of a hash. This is the
reason why this patch is tagged as MAJOR, eventhough it's extremely
unlikely that anyone will ever notice a change at all.

This patch must be backported to all supported branches since the
hashes were introduced in 1.5-dev20 (commit 98634f0c). Some parts
may be dropped since implemented later.

Link to Wietse's report:
https://marc.info/?l=postfix-users&m=157879464518535&w=2

CLEANUP: proxy: simplify proxy_parse_rate_limit proxy checks

rate-limits are valid for both frontend and listen, but not backend; so
we can simplify this check in a similar manner as it is done in e.g
max-keep-alive-queue.

this should fix github issue #449

Signed-off-by: William Dauchy <w.dauchy@criteo.com>

BUG/MEDIUM: raw_sock: Make sur the fd and conn are sync.

Commit 08fa16e397ffb1c6511b98ade2a3bfff9435e521 made sure
we let the fd layer we didn't want to poll anymore if
we failed to send and sendto() returne EAGAIN.
However, just disabling the polling with fd_stop_send()
while not notifying the connection layer means the
connection layer may believe the polling is activated
and nothing needs to be done when it is wrong.
A better fix is to revamp that whole code, for the
time being, just make sure the fd and connection
layer are properly synchronised.

This should fix the problem recently reported on FreeBSD.

BUG/MEDIUM: mux_h1: Don't call h1_send if we subscribed().

In h1_snd_buf(), only attempt to call h1_send() if we haven't
already subscribed.
It makes no sense to do it if we subscribed, as we know we failed
to send before, and will create a useless call to sendto(), and
in 2.2, the call to raw_sock_from_buf() will disable polling if
it is enabled.

This should be backported to 2.2, 2.1, 2.0 and 1.9.

CLEANUP: compression: remove unused deinit_comp_ctx section

Since commit 27d93c3f9428 ("BUG/MAJOR: compression/cache: Make it
really works with these both filters"), we no longer use section
deinit_comp_ctx.

This should fix github issue #441

Signed-off-by: William Dauchy <w.dauchy@criteo.com>

DOC: clarify crt-base usage

crt-base is also used after "crt" directive.

Signed-off-by: William Dauchy <w.dauchy@criteo.com>

BUG/MEDIUM: mworker: remain in mworker mode during reload

If you reload an haproxy started in master-worker mode with
"master-worker" in the configuration, and no "-W" argument,
the new process lost the fact that is was in master-worker mode
resulting in weird behaviors.

The bigest problem is that if it is reloaded with an bad configuration,
the master will exits instead of remaining in waitpid mode.

This problem was discovered in bug #443.

Should be backported in every version using the master-worker mode.
(as far as 1.8)

REGTEST: mcli/mcli_start_progs: start 2 programs

This regtest tests the issue #446 by starting 2 programs and checking if
they exist in the "show proc" of the master CLI.

Should be backported as far as 2.0.

BUG/MINOR: cli/mworker: can't start haproxy with 2 programs

When trying to start HAProxy with the master CLI and more than one
program in the configuration, it refuses to start with:

[ALERT] 013/132926 (1378) : parsing [cur--1:0] : proxy 'MASTER', another server named 'cur--1' was already defined at line 0, please use distinct names.
[ALERT] 013/132926 (1378) : Fatal errors found in configuration.

The problem is that haproxy tries to create a server for the MASTER
proxy but only the worker are supposed to be in the server list.

Fix issue #446.

Must be backported as far as 2.0.

BUG/MEDIUM: mux-h2: don't stop sending when crossing a buffer boundary

In version 2.0, after commit 9c218e7521 ("MAJOR: mux-h2: switch to next
mux buffer on buffer full condition."), the H2 mux started to use a ring
buffer for the output data in order to reduce competition between streams.
However, one corner case was suboptimally covered: when crossing a buffer
boundary, we have to shrink the outgoing frame size to the one left in
the output buffer, but this shorter size is later used as a signal of
incomplete send due to a buffer full condition (which used to be true when
using a single buffer). As a result, function h2s_frt_make_resp_data()
used to return less than requested, which in turn would cause h2_snd_buf()
to stop sending and leave some unsent data in the buffer, and si_cs_send()
to subscribe for sending more later.

But it goes a bit further than this, because subscribing to send again
causes the mux's send_list not to be empty anymore, hence extra streams
can be denied the access to the mux till the first stream is woken again.
This causes a nasty wakeup-sleep dance between streams that makes it
totally impractical to try to remove the sending list. A test showed
that it was possible to observe 3 million h2_snd_buf() giveups for only
100k requests when using 100 concurrent streams on 20kB objects.

It doesn't seem likely that a stream could get blocked and time out due
to this bug, though it's not possible either to demonstrate the opposite.
One risk is that incompletely sent streams do not have any blocking flags
so they may not be identified as blocked. However on first scan of the
send_list they meet all conditions for a wakeup.

This patch simply allows to continue on a new frame after a partial
frame. with only this change, the number of failed h2_snd_buf() was
divided by 800 (4% of calls). And by slightly increasing the H2C_MBUF_CNT
size, it can go down to zero.

This fix must be backported to 2.1 and 2.0.

MEDIUM: lua: don't call the GC as often when dealing with outgoing connections

In order to properly close connections established from Lua in case
a Lua context dies, the context currently automatically gets a flag
HLUA_MUST_GC set whenever an outgoing connection is used. This causes
the GC to be enforced on the context's death as well as on yield. First,
it does not appear necessary to do it when yielding, since if the
connections die they are already cleaned up. Second, the problem with
the flag is that even if a connection gets properly closed, the flag is
not removed and the GC continues to be called on the Lua context.

The impact on performance looks quite significant, as noticed and
diagnosed by Sadasiva Gujjarlapudi in the following thread:

https://www.mail-archive.com/haproxy@formilux.org/msg35810.html

This patch changes the flag for a counter so that each created
connection increments it and each cleanly closed connection decrements
it. That way we know we have to call the GC on the context's death only
if the count is non-null. As reported in the thread above, the Lua
performance gain is now over 20% by doing this.

Thanks to Sada and Thierry for the design discussion and tests that
led to this solution.

CLEANUP: ssl: remove opendir call in ssl_sock_load_cert

Since commit 3180f7b55434 ("MINOR: ssl: load certificates in
alphabetical order"), `readdir` was replaced by `scandir`. We can indeed
replace it with a check on the previous `stat` call.

This micro cleanup can be a good benefit when you have hundreds of bind
lines which open TLS certificates directories in terms of syscall,
especially in a case of frequent reloads.

Signed-off-by: William Dauchy <w.dauchy@criteo.com>

BUG/MEDIUM: mux-h2: fix missing test on sending_list in previous patch

Previous commit 989539b048 ("BUG/MINOR: mux-h2: use a safe
list_for_each_entry in h2_send()") accidently lost its sending_list test,
resulting in some elements to be woken up again while already in the
sending_list and h2_unsubscribe() crashing on integrity tests (only
when built with DEBUG_DEV).

If the fix above is backported this one must be as well.

BUG/MINOR: mux-h2: use a safe list_for_each_entry in h2_send()

h2_send() uses list_for_each_entry() to scan paused streams and resume
them, but happily deletes any leftover from a previous failed unsubscribe,
which is obviously not safe and would corrupt the list. In practice this
is a proof that this doesn't happen, but it's not the best way to prove it.
In order to fix this and reduce the maintenance burden caused by code
duplication (this list walk exists at 3 places), let's introduce a new
function h2_resume_each_sending_h2s() doing exactly this and use it at
all 3 places.

This bug was introduced as a side effect of fix 998410a41b ("BUG/MEDIUM:
h2: Revamp the way send subscriptions works.") so it should be backported
as far as 1.9.

BUG/MEDIUM: tasks: Use the MT macros in tasklet_free().

In tasklet_free(), to attempt to remove ourself, use MT_LIST_DEL, we can't
just use LIST_DEL(), as we theorically could be in the shared tasklet list.

This should be backported to 2.1.

BUG/MINOR: stream-int: Don't trigger L7 retry if max retries is already reached

When an HTTP response is received, at the stream-interface level, if a L7 retry
must be triggered because of the status code, the response is trashed and a read
error is reported on the response channel. Then the stream handles this error
and perform the retry. Except if the maximum connection retries is reached. In
this case, an error is reported. Because the server response was already trashed
by the stream-interface, a generic 502 error is returned to the client instead
of the server's one.

Now, the stream-interface triggers a L7 retry only if the maximum connection
retries is not already reached. Thus, at the end, the last server's response is
returned.

This patch must be backported to 2.1 and 2.0. It should fix the issue #439.

REGTEST: set_ssl_cert.vtc: replace "echo" with "printf"

"echo -e" for some reason does not work on travis-ci, so let us switch
to "printf"

Fixes: #423

CLEANUP: server: remove unused err section in server_finalize_init

Since commit 980855bd953c ("BUG/MEDIUM: server: initialize the orphaned
conns lists and tasks at the end"), we no longer use err section.

This should fix github issue #438

Signed-off-by: William Dauchy <w.dauchy@criteo.com>

BUG/MAJOR: listener: do not schedule a task-less proxy

Apparently seamingless commit 0591bf7deb ("MINOR: listener: make the
wait paths cleaner and more reliable") caused a nasty regression and
revealed a rare race that hits regtest stickiness/lb-services.vtc
about 4% of the times for 8 threads.

The problem is that when a multi-threaded listener wakes up on an
incoming connection, several threads can receive the event, especially
when idle. And all of them will race to accept the connections in
parallel, adjusting the listener's nbconn and proxy's feconn until
one reaches the proxy's limit and declines. At this step the changes
are cancelled, the listener is marked "limited", and when the threads
exit the function, one of them will unlimit the listener/proxy again
so that it can accept incoming connections again.

The problem happens when many threads connect to a small peers section
because its maxconn is very limited (typically 6 for 2 peers), and it's
sometimes possible for enough competing threads to hit the limit and
one of them will limit the listener and queue the proxy's task... except
that peers do not initialize their proxy task since they do not use rate
limiting. Thus the process crashes when doing task_schedule(p->task).
Prior to the cleanup patch above, this didn't happen because the error
path that was dedicated to only limiting the listener did not call
task_schedule(p->task).

Given that the proxy's task is optional, and that the expire value
passed there is always TICK_ETERNITY, it's sufficient and reasonable to
avoid calling this task_schedule() when expire is not set. And for long
term safety we can also avoid to do it when the task is not set. A first
fix consisted in allocating a task for the peers proxies but it's never
used and would eat resources for reason.

No backport is needed as this commit was only merged into 2.2.

BUILD: cirrus-ci: choose proper openssl package name

freebsd-11.3 and 12.1 comes with different openssl naming
let us add proper switch to cirrus-ci script

CLEANUP: mux-h2: remove unused goto "out_free_h2s"

Since commit fa8aa867b915 ("MEDIUM: connections: Change struct
wait_list to wait_event.") we no longer use this section.

this should fix github issue #437

Signed-off-by: William Dauchy <w.dauchy@criteo.com>

MINOR: http: Add 404 to http-request deny

This patch adds http status code 404 Not Found to http-request deny. See
issue #80.

MINOR: http: Add 410 to http-request deny

This patch adds http status code 410 Gone to http-request deny. See
issue #80.

MINOR: raw_sock: make sure to disable polling once everything is sent

Analysing traces revealed a rare but surprizing pattern :

    connect()  = -1 EAGAIN
    send()     = success
    epoll_ctl(ADD, EPOLLOUT)
    epoll_wait()
    recvfrom() = success
    close()

What happens is that the failed connect() creates an FD update for pollout,
but the successful synchronous send() doesn't disable it because polling was
only disabled in the FD handler. But a successful synchronous connect()
cancellation is a good opportunity to disable polling before it's effectively
enabled in the next loop, so better disable it when reaching the end. The
cost is very low if it was already disabled anyway (one atomic op).

This only affects local connections but with this the typical number of
epoll_ctl() calls per connection dropped from ~4.2 to ~3.8 for plain TCP
and 10k transfers.

MEDIUM: dns: implement synchronous send

In dns_send_query(), there's no point in first waking up the FD, to get
called back by the poller to send the request and sleep. Instead let's
simply send the request as soon as it's known and only subscribe to the
poller when the socket buffers are full and it's required to poll (i.e.
almost never).

This significantly reduces the number of calls to the poller. A large
config sees the number of epoll_ctl() calls reduced from 577 to 7 over
10 seconds, the number of recvfrom() from 1533 to 582 and the number of
sendto() from 369 to 162.

It also has the extra benefit of building each requests only once per
resolution and sending it to multiple resolvers instead of rebuilding
it for each and every resolver.

This will reduce the risk of seeing situations similar to bug #416 in
the future.

BUG/MEDIUM: session: do not report a failure when rejecting a session

In session_accept_fd() we can perform a synchronous call to
conn_complete_session() and if it succeeds the connection is accepted
and turned into a session. If it fails we take it as an error while it
is not, in this case, it's just that a tcp-request rule has decided to
reject the incoming connection. The problem with reporting such an event
as an error is that the failed status is passed down to the listener code
which decides to disable accept() for 100ms in order to leave some time
for transient issues to vanish, and that's not what we want to do here.

This fix must be backported as far as 1.7. In 1.7 the code is a bit
different as tcp_exec_l5_rules() is called directly from within
session_new_fd() and ret=0 must be assigned there.

BUG/MINOR: channel: inject output data at the end of output

In co_inject(), data must be inserted at the end of output, not the end of
input. For the record, this function does not take care of input data which are
supposed to not exist. But the caller may reset input data after or before the
call. It is its own choice.

This bug, among other effects, is visible when a redirect is performed on
the response path, on legacy HTTP mode (so for HAProxy < 2.1). The redirect
response is appended after the server response when it should overwrite it.

Thanks to Kevin Zhu <ip0tcp@gmail.com> to report the bug. It must be backported
as far as 1.9.

BUG/MEDIUM: http-ana: Truncate the response when a redirect rule is applied

When a redirect rule is executed on the response path, we must truncate the
received response. Otherwise, the redirect is appended after the response, which
is sent to the client. So it is obviously a bug because the redirect is not
performed. With bodyless responses, it is the "only" bug. But if the response
has a body, the result may be invalid. If the payload is not fully received yet
when the redirect is performed, an internal error is reported.

It must be backported as far as 1.9.

BUG/MINOR: proxy: Fix input data copy when an error is captured

In proxy_capture_error(), input data are copied in the error snapshot. The copy
must take care of the data wrapping. But the length of the first block is
wrong. It should be the amount of contiguous input data that can be copied
starting from the input's beginning. But the mininum between the input length
and the buffer size minus the input length is used instead. So it is a problem
if input data are wrapping or if more than the half of the buffer is used by
input data.

This patch must be backported as far as 1.9.

BUG/MINOR: h1: Report the right error position when a header value is invalid

During H1 messages parsing, when the parser has finished to parse a full header
line, some tests are performed on its value, depending on its name, to be sure
it is valid. The content-length is checked and converted in integer and the host
header is also checked. If an error occurred during this step, the error
position must point on the header value. But from the parser point of view, we
are already on the start of the next header. Thus the effective reported
position in the error capture is the beginning of the unparsed header line. It
is a bit confusing when we try to figure out why a message is rejected.

Now, the parser state is updated to point on the invalid value. This way, the
error position really points on the right position.

This patch must be backported as far as 1.9.

MINOR: ssl: Remove unused variable "need_out".

The "need_out" variable was used to let the ssl code know we're done
reading early data, and we should start the handshake.
Now that the handshake function is responsible for taking care of reading
early data, all that logic has been removed from ssl_sock_to_buf(), but
need_out was forgotten, and left. Remove it know.
This patch was submitted by William Dauchy <w.dauchy@criteo.com>, and should
fix github issue #434.
This should be backported to 2.0 and 2.1.

MINOR: config: disable busy polling on old processes

in the context of seamless reload and busy polling, older processes will
create unecessary cpu conflicts; we can assume there is no need for busy
polling for old processes which are waiting to be terminated.

This patch is not a bug fix itself but might be a good stability
improvment when you are un the context of frequent seamless reloads with
a high "hard-stop-after" value; for that reasons I think this patch
should be backported in all 2.x versions.

Signed-off-by: William Dauchy <w.dauchy@criteo.com>

BUILD: CI: modernize cirrus-ci

use freebsd-12.1 instead of freebsd-12.0,
add freebsd-11.3 to build matrix,
install socat in order to run modern reg-tests

BUG/MEDIUM: connections: Hold the lock when wanting to kill a connection.

In connect_server(), when we decide we want to kill the connection of
another thread because there are too many idle connections, hold the
toremove_lock of the corresponding thread, othervise, there's a small race
condition where we could try to add the connection to the toremove_connections
list while it has already been free'd.

This should be backported to 2.0 and 2.1.

BUG/MEDIUM: checks: Only attempt to do handshakes if the connection is ready.

When creating a new check connection, only attempt to add an handshake
connection if the connection has fully been initialized. It can not be the
case if a DNS resolution is still pending, and thus we don't yet have the
address for the server, as the handshake code assumes the connection is fully
initialized and would otherwise crash.
This is not ideal, the check shouldn't probably run until we have an address,
as it leads to check failures with "Socket error".
While I'm there, also add an xprt handshake if we're using socks4, otherwise
checks wouldn't be able to use socks4 properly.
This should fix github issue #430

This should be backported to 2.0 and 2.1.

OPTIM: polling: do not create update entries for FD removal

In order to reduce the number of poller updates, we can benefit from
the fact that modern pollers use sampling to report readiness and that
under load they rarely report the same FD multiple times in a row. As
such it's not always necessary to disable such FDs especially when we're
almost certain they'll be re-enabled again and will require another set
of syscalls.

Now instead of creating an update for a (possibly temporary) removal,
we only perform this removal if the FD is reported again as ready while
inactive. In addition this is performed via another update so that
alternating workloads like transfers have a chance to re-enable the
FD without any syscall during the loop (typically after the data that
filled a buffer have been sent). However we only do that for single-
threaded FDs as the other ones require a more complex setup and are not
on the critical path.

This does cause a few spurious wakeups but almost totally eliminates the
calls to epoll_ctl() on connections seeing intermitent traffic like HTTP/1
to a server or client.

A typical example with 100k requests for 4 kB objects over 200 connections
shows that the number of epoll_ctl() calls doesn't depend on the number
of requests anymore but most exclusively on the number of established
connections:

Before:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
57.09    0.499964           0    654361    321190 recvfrom
38.33    0.335741           0    369097         1 epoll_wait
  4.56    0.039898           0     44643           epoll_ctl
  0.02    0.000211           1       200       200 connect
------ ----------- ----------- --------- --------- ----------------
100.00    0.875814               1068301    321391 total

After:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
59.25    0.504676           0    657600    323630 recvfrom
40.68    0.346560           0    374289         1 epoll_wait
  0.04    0.000370           0       620           epoll_ctl
  0.03    0.000228           1       200       200 connect
------ ----------- ----------- --------- --------- ----------------
100.00    0.851834               1032709    323831 total

As expected there is also a slight increase of epoll_wait() calls since
delaying de-activation of events can occasionally cause one spurious
wakeup.

OPTIM: epoll: always poll for recv if neither active nor ready

The cost of enabling polling in one direction with epoll is very high
because it requires one syscall per FD and per direction change. In
addition we don't know about input readiness until we either try to
receive() or enable polling and watch the result. With HTTP keep-alive,
both are equally expensive as it's very uncommon to see the server
instantly respond (unless it's a second stage of the same process on
localhost, which has become much less common with threads).

But when a connection is established it's also quite usual to have to
poll for sending (except on localhost or UNIX sockets where it almost
always instantly works). So this cost of polling could be factored out
with the second step if both were enabled together.

This is the idea behind this patch. What it does is to always enable
polling for Rx if it's not ready and at least one direction is active.
This means that if it's not explicitly disabled, or if it was but in a
state that causes the loss of the information (rx ready cannot be
guessed), then let's take any opportunity for a polling change to
enable it at the same time, and learn about rx readiness for free.

In addition the FD never gets unregistered for Rx unless it's ready
and was blocked (buffer full). This avoids a lot of the flip-flop
behaviour at beginning and end of requests.

On a test with 10k requests in keep-alive, the difference is quite
noticeable:

Before:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
83.67    0.010847           0     20078           epoll_ctl
16.33    0.002117           0      2231           epoll_wait
  0.00    0.000000           0        20        20 connect
------ ----------- ----------- --------- --------- ----------------
100.00    0.012964                 22329        20 total

After:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
96.35    0.003351           1      2644           epoll_wait
  2.36    0.000082           4        20        20 connect
  1.29    0.000045           0        66           epoll_ctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.003478                  2730        20 total

It may also save a recvfrom() after connect() by changing the following
sequence, effectively saving one epoll_ctl() and one recvfrom() :

           before              |            after
  -----------------------------+----------------------------
  - connect()                  |  - connect()
  - epoll_ctl(add,out)         |  - epoll_ctl(add, in|out)
  - sendto()                   |  - epoll_wait() = out
  - epoll_ctl(mod,in|out)      |  - send()
  - epoll_wait() = out         |  - epoll_wait() = in|out
  - recvfrom() = EAGAIN        |  - recvfrom() = OK
  - epoll_ctl(mod,in)          |  - recvfrom() = EAGAIN
  - epoll_wait() = in          |  - epoll_ctl(mod, in)
  - recvfrom() = OK            |  - epoll_wait()
  - recvfrom() = EAGAIN        |
  - epoll_wait()               |
    (...)

Now on a 10M req test on 16 threads with 2k concurrent conns and 415kreq/s,
we see 190k updates total and 14k epoll_ctl() only.

MINOR: poller: do not call the IO handler if the FD is not active

For now this almost never happens but with subsequent patches it will
become more important not to uselessly call the I/O handlers if the FD
is not active.

CLEANUP: connection: merge CO_FL_NOTIFY_DATA and CO_FL_NOTIFY_DONE

Both flags became equal in commit 82967bf9 ("MINOR: connection: adjust
CO_FL_NOTIFY_DATA after removal of flags"), which already predicted the
overlap between xprt_done_cb() and wake() after the removal of the DATA
specific flags in 1.8. Let's simply remove CO_FL_NOTIFY_DATA since the
"_DONE" version already covers everything and explains the intent well
enough.

MINOR: connection: remove the double test on xprt_done_cb()

The conn_fd_handler used to have one possible call to this function to
notify about end of handshakes, and another one to notify about connection
setup or error. But given that we're now only performing wakeup calls
after connection validation, we don't need to keep two places to run
this test since the conditions do not change in between.

This patch merges the two tests into a single one and moves the
CO_FL_CONNECTED test appropriately as well so that it's called even
on the error path if needed.

MINOR: connection: check for connection validation earlier

In conn_fd_handler() we used to first give a chance to the send()
callback to try to send data and validate the connection at the same
time. But since 1.9 we do not call this callback anymore inline, it's
scheduled. So let's validate the connection ealier so that all other
decisions can be taken based on this confirmation. This may notably
be useful to the xprt_done_cb() to know that the connection was
properly validated.

REORG: connection: move tcp_connect_probe() to conn_fd_check()

The function is not TCP-specific at all, it covers all FD-based sockets
so let's move this where other similar functions are, in connection.c,
and rename it conn_fd_check().

MEDIUM: tcp: make tcp_connect_probe() consider ERR/HUP

Now that we know what pollers can return ERR/HUP, we can take this
into account to save one syscall: with such a poller, if neither are
reported, then we know the connection succeeded and we don't need to
go with getsockopt() nor connect() to validate this. In addition, for
the remaining cases (select() or suspected errors), we'll always go
through the extra connect() attempt and enumerate possible "in progress",
"connected" or "failed" status codes and take action solely based on
this.

This results in one saved syscall on modern pollers, only a second
connect() still being used on select() and the server's address never
being needed anymore.

Note that we cannot safely replace connect() with getsockopt() as the
latter clears the error on the socket without saving it, and health
checks rely on it for their reporting. This would be OK if the error
was saved in the connection itself.

MINOR: pollers: add a new flag to indicate pollers reporting ERR & HUP

In practice it's all pollers except select(). It turns out that we're
keeping some legacy code only for select and enforcing it on all
pollers, let's offer the pollers the ability to declare that they
do not need that.

CLEANUP: connection: conn->xprt is never NULL

Let's remove this outdated test that's been there since 1.5. For quite
some time now xprt hasn't been NULL anymore on an initialized connection.

BUG/MINOR: connection: only wake send/recv callbacks if the FD is active

Since commit c3df4507fa ("MEDIUM: connections: Wake the upper layer even
if sending/receiving is disabled.") the send/recv callbacks are called
on I/O if the FD is ready and not just if it's active. This means that
in some situations (e.g. send ready but nothing to send) we may
needlessly enter the if() block, notice we're not subscribed, set
io_available=1 and call the wake() callback even if we're just called
for read activity. Better make sure we only do this when the FD is
active in that direction..

This may be backported as far as 2.0 though it should remain under
observation for a few weeks first as the risk of harm by a mistake
is higher than the trouble it should cause.

BUG/MINOR: checks: refine which errno values are really errors.

Two regtest regularly fail in a random fashion depending on the machine's
load (one could really wonder if it's really worth keeping such
unreproducible tests) :
- tcp-check_multiple_ports.vtc
- 4be_1srv_smtpchk_httpchk_layer47errors.vtc

It happens that one of the reason is the time it takes to connect to
the local socket (hence the load-dependent aspect): if connect() on the
loopback returns EINPROGRESS then this status is reported instead of a
real error. Normally such a test is expected to see the error cleaned
by tcp_connect_probe() but it really depends on the timing and instead
we may very well send() first and see this error. The problem is that
everything is collected based on errno, hoping it won't get molested
in the way from the last unsuccesful syscall to wake_srv_chk(), which
obviously is hard to guarantee.

This patch at least makes sure that a few non-errors are reported as
zero just like EAGAIN. It doesn't fix the root cause but makes it less
likely to report incorrect failures.

This fix could be backported as far as 1.9.