git.ipfire.org Git - thirdparty/haproxy.git/log

MEDIUM: fd: do not use the FD_POLL_* flags in the pollers anymore

As mentioned in previous commit, these flags do not map well to
modern poller capabilities. Let's use the FD_EV_*_{R,W} flags instead.
This first patch only performs a 1-to-1 mapping making sure that the
previously reported flags are still reported identically while using
the closest possible semantics in the pollers.

It's worth noting that kqueue will now support improvements such as
returning distinctions between shut and errors on each direction,
though this is not exploited for now.

MINOR: fd: add two flags ERR and SHUT to describe FD states

There's currently a big ambiguity on our use of POLLHUP because we
currently map POLLHUP and POLLRDHUP to FD_POLL_HUP. The first one
indicates a close in *both* directions while the second one indicates
a unidirectional close. Since we don't know from the resulting flag
we always have to read when reported. Furthermore kqueue only reports
unidirectional responses which are mapped to FD_POLL_HUP as well, and
their write closes are mapped to a general error.

We could add a new FD_POLL_RDHUP flag to improve the mapping, or
switch only to the POLL* flags, but that further complicates the
portability for operating systems like FreeBSD which do not have
POLLRDHUP but have its semantics.

Let's instead directly use the per-direction flag values we already
have, and it will be a first step in the direction of finer states.
Thus we introduce an ERR and a SHUT status for each direction, that
the pollers will be able to compute and pass to fd_update_events().

It's worth noting that FD_EV_STATUS already sees the two new flags,
but they are harmless since used only by fd_{recv,send}_state() which
are never called. Thus in its current state this patch must be totally
transparent.

MEDIUM: connection: enable reading only once the connection is confirmed

In order to address the absurd polling sequence described in issue #253,
let's make sure we disable receiving on a connection until it's established.
Previously with bottom-top I/Os, we were almost certain that a connection
was ready when the first I/O was confirmed. Now we can enter various
functions, including process_stream(), which will attempt to read
something, will fail, and will then subscribe. But we don't want them
to try to receive if we know the connection didn't complete. The first
prerequisite for this is to mark the connection as not ready for receiving
until it's validated. But we don't want to mark it as not ready for sending
because we know that attempting I/Os later is extremely likely to work
without polling.

Once the connection is confirmed we re-enable recv readiness. In order
for this event to be taken into account, the call to tcp_connect_probe()
was moved earlier, between the attempt to send() and the attempt to recv().
This way if tcp_connect_probe() enables reading, we have a chance to
immediately fall back to this and read the possibly pending data.

Now the trace looks like the following. It's far from being perfect
but we've already saved one recvfrom() and one epollctl():

epoll_wait(3, [], 200, 0) = 0
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7
fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
connect(7, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=7, u64=7}}) = 0
epoll_wait(3, [{EPOLLOUT, {u32=7, u64=7}}], 200, 1000) = 1
connect(7, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
sendto(7, "OPTIONS / HTTP/1.0\r\n\r\n", 22, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 22
epoll_ctl(3, EPOLL_CTL_MOD, 7, {EPOLLIN|EPOLLRDHUP, {u32=7, u64=7}}) = 0
epoll_wait(3, [{EPOLLIN|EPOLLRDHUP, {u32=7, u64=7}}], 200, 1000) = 1
getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(7, "HTTP/1.0 200\r\nContent-length: 0\r\nX-req: size=22, time=0 ms\r\nX-rsp: id=dummy, code=200, cache=1, size=0, time=0 ms (0 real)\r\n\r\n", 16384, 0, NULL, NULL) = 126
close(7) = 0

MINOR: fd: add two new calls fd_cond_{recv,send}()

These two functions are used to enable recv/send but only if the FD is
not marked as active yet. The purpose is to conditionally mark them as
tentatively usable without interfering with the polling if polling was
already enabled, when it's supposed to be likely true.

MEDIUM: fd: mark the FD as ready when it's inserted

Given that all our I/Os are now directed from top to bottom and not the
opposite way around, and the FD cache was removed, it doesn't make sense
anymore to create FDs that are marked not ready since this would prevent
the first accesses unless the caller explicitly does an fd_may_recv()
which is not expected to be its job (which conn_ctrl_init() has to do
by the way). Let's move this into fd_insert() instead, and have a single
atomic operation for both directions via fd_may_both().

BUG/MAJOR: ssl: ssl_sock was not fully initialized.

'ssl_sock' wasn't fully initialized so a new session can inherit some
flags from an old one.

This causes some fetches, related to client's certificate presence or
its verify status and errors, returning erroneous values.

This issue could generate other unexpected behaviors because a new
session could also inherit other flags such as SSL_SOCK_ST_FL_16K_WBFSIZE,
SSL_SOCK_SEND_UNLIMITED, or SSL_SOCK_RECV_HEARTBEAT from an old session.

This must be backported to 2.0 but it's useless for previous.

BUG/MINOR: lb/leastconn: ignore the server weights for empty servers

As discussed in issue #178, the change brought around 1.9-dev11 by commit
1eb6c55808 ("MINOR: lb: make the leastconn algorithm more accurate")
causes some harm in the situation it tried to improve. By always applying
the server's weight even for no connection, we end up always picking the
same servers for the first connections, so under a low load, if servers
only have either 0 or 1 connections, in practice the same servers will
always be picked.

This patch partially restores the original behaviour but still keeping
the spirit of the aforementioned patch. Now what is done is that servers
with no connections will always be picked first, regardless of their
weight, so they will effectively follow round-robin. Only servers with
one connection or more will see an accurate weight applied.

This patch was developed and tested by @malsumis and @jaroslawr who
reported the initial issue. It should be backported to 2.0 and 1.9.

MINOR: contrib/prometheus-exporter: Report DRAIN/MAINT/NOLB status for servers

Now, following status are reported for servers:0=DOWN, 1=UP, 2=MAINT, 3=DRAIN,
4=NOLB.

It is linked to the github issue #255. Thanks to Mickaël Martin. If needed, this
patch may be backported to 2.0.

BUILD: CI: add basic CentOS 6 cirrus build

BUG/MINOR: mux-h1: Fix a UAF in cfg_h1_headers_case_adjust_postparser()

When an error occurs in the post-parser callback which checks configuration
validity of the option outgoing-headers-case-adjust-file, the error message is
freed too early, before being used.

No backport needed. It fixes the github issue #258.

BUG/MINOR: checks: do not uselessly poll for reads before the connection is up

It's pointless to start to perform a recv() call on a connection that is
not yet established. The only purpose used to be to subscribe but that
causes many extra syscalls when we know we can do it later.

This patch only attempts a read if the connection is established or if
there is no write planed, since we want to be certain to be called. And
in wake_srv_chk() we continue to attempt to read if the reader was not
subscribed, so as to perform the first read attempt. In case a first
result is provided, __event_srv_chk_r() will not do anything anyway so
this is totally harmless in this case.

This fix requires that commit "BUG/MINOR: checks: make __event_chk_srv_r()
report success before closing" is applied before, otherwise it will break
some checks (notably SSL) by doing them again after the connection is shut
down. This completes the fixes on the checks described in issue #253 by
roughly cutting the number of syscalls in half. It must be backported to
2.0.

BUG/MINOR: checks: make __event_chk_srv_r() report success before closing

On a plain TCP check, this function will do nothing except shutting the
connection down and will not even update the status. This prevents it
from being called again, which is the reason why we attempt to do it
once too early. Let's first fix this function to make it report success
on plain TCP checks before closing, as it does for all other ones.

This must be backported to 2.0. It should be safe to backport to older
versions but it doesn't seem it would fix anything there.

BUG/MINOR: checks: start sending the request right after connect()

Since the change of I/O direction, we must not wait for an empty connect
callback before sending the request, we must attempt to send it as soon
as possible so that we don't uselessly poll. This is what this patch
does. This reduces the total check duration by a complete poll loop
compared to what is described in issue #253.

This must be backported to 2.0.

BUG/MINOR: checks: stop polling for write when we have nothing left to send

Since the change of I/O direction, we perform the connect() call and
the send() call together from the top. But the send call must at least
disable polling for writes once it does not have anything left to send.

This bug is partially responsible for the waste of resources described
in issue #253.

This must be backported to 2.0.

CONTRIB: debug: add new program "poll" to test poll() events

This simple program prepares a TCP connection between two ends and
allows to perform various operations on them such as send, recv, poll,
shutdown, close, reset, etc. It takes care of remaining particularly
silent to help inspection via strace, though it can also be verbose
and report status, errno, and poll events. It delays acceptation of
the incoming server-side connection so that it's even possible to
test the poll status on a listener with a pending connection, or
to close the connection without accepting it and inspect the effect
on the client.

Actions are executed in the command line order as they are parsed,
they may be grouped using commas when they are performed on the same
socket.

Example showing a successful recv() of pending data before a pending error:
   $ ./poll -v -l pol,acc,pol -c snd,shw -s pol,rcv,pol,rcv,pol,snd,lin,clo -c pol,rcv,pol,rcv,pol

   #### BEGIN ####
   cmd #1 stp #1: do_pol(3): ret=1 ev=0x1 (IN)
   cmd #1 stp #2: do_acc(3): ret=5
   cmd #1 stp #3: do_pol(3): ret=0 ev=0
   cmd #2 stp #1: do_snd(4): ret=3
   cmd #2 stp #2: do_shw(4): ret=0
   cmd #3 stp #1: do_pol(5): ret=1 ev=0x2005 (IN OUT RDHUP)
   cmd #3 stp #2: do_rcv(5): ret=3
   cmd #3 stp #3: do_pol(5): ret=1 ev=0x2005 (IN OUT RDHUP)
   cmd #3 stp #4: do_rcv(5): ret=0
   cmd #3 stp #5: do_pol(5): ret=1 ev=0x2005 (IN OUT RDHUP)
   cmd #3 stp #6: do_snd(5): ret=3
   cmd #3 stp #7: do_lin(5): ret=0
   cmd #3 stp #8: do_clo(5): ret=0
   cmd #4 stp #1: do_pol(4): ret=1 ev=0x201d (IN OUT ERR HUP RDHUP)
   cmd #4 stp #2: do_rcv(4): ret=3
   cmd #4 stp #3: do_pol(4): ret=1 ev=0x201d (IN OUT ERR HUP RDHUP)
   cmd #4 stp #4: do_rcv(4): ret=-1 (Connection reset by peer)
   cmd #4 stp #5: do_pol(4): ret=1 ev=0x2015 (IN OUT HUP RDHUP)
   #### END ####

MINOR: fd: make updt_fd_polling() a normal function

It's called from many places, better use a real function than an inline.

MEDIUM: fd: simplify the fd_*_{recv,send} functions using BTS/BTR

Now that we don't have to update FD_EV_POLLED_* at the same time as
FD_EV_ACTIVE_*, we don't need to use a CAS anymore, a bit-test-and-set
operation is enough. Doing so reduces the code size by a bit more than
1 kB. One function was special, fd_done_recv(), whose comments and doc
were inaccurate for the part related to the lack of polling.

MEDIUM: fd: remove the FD_EV_POLLED status bit

Since commit 7ac0e35f2 in 1.9-dev1 ("MAJOR: fd: compute the new fd polling
state out of the fd lock") we've started to update the FD POLLED bit a
bit more aggressively. Lately with the removal of the FD cache, this bit
is always equal to the ACTIVE bit. There's no point continuing to watch
it and update it anymore, all it does is create confusion and complicate
the code. One interesting side effect is that it now becomes visible that
all fd_*_{send,recv}() operations systematically call updt_fd_polling(),
except fd_cant_recv()/fd_cant_send() which never saw it change.

BUG/MINOR: mux-h1: Fix a possible null pointer dereference in h1_subscribe()

This patch fixes the github issue #243. No backport needed.

BUG/MEDIUM: cache: Don't cache objects if the size of headers is too big

HTTP responses with headers than impinge upon the reserve must not be
cached. Otherwise, there is no warranty to have enough space to add the header
"Age" when such cached responses are delivered.

This patch must be backported to 2.0 and 1.9. For these versions, the same must
be done for the legacy HTTP mode.

BUG/MEDIUM: cache: Properly copy headers splitted on several shctx blocks

In the cache, huge HTTP headers will use several shctx blocks. When a response
is returned from the cache, these headers must be properly copied in the
corresponding HTX message by updating the pointer where to copied a header
part.

This patch must be backported to 2.0 and 1.9.

BUG/MINOR: mux-h1: Be sure to update the count before adding EOM after trailers

Otherwise, an EOM may be added in a full buffer.

This patch must be backported to 2.0.

BUG/MINOR: mux-h1: Don't stop anymore input processing when the max is reached

The loop is now stopped only when nothing else is consumed from the input buffer
or if a parsing error is encountered. This will let a chance to detect cases
when we fail to add the EOM.

For instance, when the max is reached after the headers parsing and all the
message is received. In this case, we may have the flag H1S_F_REOS set without
the flag H1S_F_APPEND_EOM and no pending input data, leading to an error because
we think it is an abort.

This patch must be backported to 2.0. This bug does not affect 1.9.

BUG/MINOR: mux-h1: Fix size evaluation of HTX messages after headers parsing

The block size of the start-line was not counted.

This patch must be backported to 2.0.

BUG/MINOR: h1: Properly reset h1m when parsing is restarted

Otherwise some processing may be performed twice. For instance, if the header
"Content-Length" is parsed on the first pass, when the parsing is restarted, we
skip it because we think another header with the same value was already seen. In
fact, it is currently the only existing bug that can be encountered. But it is
safer to reset all the h1m on restart to avoid any future bugs.

This patch must be backported to 2.0 and 1.9

BUG/MINOR: http-ana: Reset response flags when 1xx messages are handled

Otherwise, the following final response could inherit of some of these
flags. For instance, because informational responses have no body, the flag
HTTP_MSGF_BODYLESS is set for 1xx messages. If it is not reset, this flag will
be kept for the final response.

One of visible effect of this bug concerns the HTTP compression. When the final
response is preceded by an 1xx message, the compression is not performed. This
was reported in github issue #229.

This patch must be backported to 2.0 and 1.9. Note that the file http_ana.c does
not exist for these branches, the patch must be applied on proto_htx.c instead.

BUILD: connection: silence gcc warning with extra parentheses

Commit 8a4ffa0a ("MINOR: send-proxy-v2: sends authority TLV according
to TLV received") is missing parentheses around a variable assignment
used as condition in an if statement, and gcc isn't happy about it.

BUG/MEDIUM: peers: local peer socket not bound.

This bug came with 015e4d7 commit: "MINOR: stick-tables: Add peers process
binding computing" where the "stick" rules cases were missing when computing
the peer local listener process binding. At parsing time we store in the
stick-table struct ->proxies_list the proxies which refer to this stick-table.
The process binding is computed after having parsed the entire configuration file
with this simple loop in cfgparse.c:

     /* compute the required process bindings for the peers from <stktables_list>
      * for all the stick-tables, the ones coming with "peers" sections included.
      */
     for (t = stktables_list; t; t = t->next) {
             struct proxy *p;

             for (p = t->proxies_list; p; p = p->next_stkt_ref) {
                     if (t->peers.p && t->peers.p->peers_fe) {
                             t->peers.p->peers_fe->bind_proc |= p->bind_proc;
                     }
             }
     }

Note that if this process binding is not correctly initialized, the child forked
by the master-worker stops the peer local listener. Should be also the case
when daemonizing haproxy.

Must be backported to 2.0.

MINOR: build: add linux-glibc-legacy build TARGET

As discussed in issue #128, introduce a new build TARGET
linux-glibc-legacy to allow the build on old, legacy OS.

Should be backported to 2.0.

MINOR: send-proxy-v2: sends authority TLV according to TLV received

Since patch "7185b789", the authority TLV in a PROXYv2 header from a
client connection is stored. Authority TLV sends in PROXYv2 should be
taken into account to allow chaining PROXYv2 without droping it.

MEDIUM: log: add support for logging to a ring buffer

Now by prefixing a log server with "ring@<name>" it's possible to send
the logs to a ring buffer. One nice thing is that it allows multiple
sessions to consult the logs in real time in parallel over the CLI, and
without requiring file system access. At the moment, ring0 is created as
a default sink for tracing purposes and is available. No option is
provided to create new rings though this is trivial to add to the global
section.

MINOR: log: add a target type instead of hacking the address family

Instead of detecting an AF_UNSPEC address family for a log server and
to deduce a file descriptor, let's create a target type field and
explicitly mention that the socket is of type FD.

MEDIUM: log: use the new generic fd_write_frag_line() function

When logging to a file descriptor, we'd rather use the unified
fd_write_frag_line() which uses the FD's lock than perform the
writev() ourselves and use a per-server lock, because if several
loggers point to the same output (e.g. stdout) they are still
not locked and their logs may interleave. The function above
instead relies on the fd's lock so this is safer and will even
protect against concurrent accesses from other areas (e.g traces).
The function also deals with the FD's non-blocking mode so we do
not have to keep specific code for this anymore in the logs.

MINOR: fd/log/sink: make the non-blocking initialization depend on the initialized bit

Logs and sinks were resorting to dirty hacks to initialize an FD to
non-blocking mode. Now we have a bit for this in the fd tab so we can
do it on the fly on first use of the file descriptor. Previously it was
set per log server by writing value 1 to the port, or during a sink
initialization regardless of the usage of the fd.

MINOR: fd: add a new "initialized" bit in the fdtab struct

The purpose is to be able to remember that initialization was already
done for a file descriptor. This will allow to get rid of some dirty
hacks performed in the logs or fd sinks where the init state of the
fd has to be guessed.

CLEANUP: fd: remove leftovers of the fdcache

The "cache" entry was still present in the fdtab struct and it was
reported in "show sess". Removing it broke the cache-line alignment
on 64-bit machines which is important for threads, so it was fixed
by adding an attribute(aligned()) when threads are in use. Doing it
only in this case allows 32-bit thread-less platforms to see the
struct fit into 32 bytes.

BUG/MINOR: ring: b_peek_varint() returns a uint64_t, not a size_t

The difference matters when building on 32-bit architectures and a
warning was rightfully emitted.

No backport is needed.

BUG/MEDIUM: mux-h2/trace: fix missing braces added with traces

Ilya reported in issue #242 that h2c_handle_priority() was having
unreachable code... Obviously, I missed the braces around the "if",
leaving an unconditional return.

No backport is needed.

BUG/MEDIUM: mux-h2/trace: do not dereference h2c->conn after failed idle

In h2_detach(), if session_check_idle_conn() returns <0 we must not
dereference it since it has been freed.

No backport is needed.

MEDIUM: ring: implement a wait mode for watchers

Now it is possible for a reader to subscribe and wait for new events
sent to a ring buffer. When new events are written to a ring buffer,
the applets that are subscribed are woken up to display new events.
For now we only support this with the CLI applet called by "show events"
since the I/O handler is indeed a CLI I/O handler. But it's not
complicated to add other mechanisms to consume events and forward them
to external log servers for example. The wait mode is enabled by adding
"-w" after "show events <sink>". An extra "-n" was added to directly
seek to new events only.

MINOR: mux-h2/trace: report the connection pointer and state before FRAME_H

Initially we didn't report anything before FRAME_H but at least the
connection's pointer and its state are desirable.

MINOR: cli: extend the CLI context with a list and two offsets

Some CLI parsers are currently abusing the CLI context types such as
pointers to stuff longs into them by lack of room. But the context is
80 bytes while cli is only 48, thus there's some room left. This patch
adds a list element and two size_t usable as various offsets. The list
element is initialized.

BUG/MINOR: ring: fix the way watchers are counted

There are two problems with the way we count watchers on a ring:
  - the test for >=255 was accidently kept to 1 used during QA
  - if the producer deletes all data up to the reader's position
    and the reader is called, cannot write, and is called again,
    it will have a zero offset, causing it to initialize itself
    multiple times and each time adding a new refcount.

Let's change this and instead use ~0 as the initial offset as it's
not possible to have it as a valid one. We now store it into the
CLI context's size_t o0 instead of casting it to a void*.

No backport needed.

MINOR: trace: extend default event names to 12 chars

With "tx_settings" the 10-chars limit was already passed, thus it sounds
reasonable to push this slightly.

CLEANUP: mux-h2/trace: lower-case event names

I wanted to do it before pushing and forgot. It's easier to type lower-
case event names and more consistent with the "none" and "any" keywords.

CLEANUP: mux-h2/trace: reformat the "received" messages for better alignment

user-level traces are more readable when visually aligned. This is easily
done by writing "rcvd" instead of "received" to align with "sent" :

  $ socat - /tmp/sock1 <<< "show events buf0"
  [00|h2|0|mux_h2.c:2465] rcvd H2 request  : [1] H2 REQ: GET /?s=10k HTTP/2.0
  [00|h2|0|mux_h2.c:4563] sent H2 response : [1] H2 RES: HTTP/1.1 200

MINOR: mux-h2/trace: report h2s->id before h2c->dsi for the stream ID

h2c->dsi is only for demuxing, and needed while decoding a new request.
But if we already have a valid stream ID (e.g. response or outgoing
request), we should use it instead. This avoids seeing [0] in front of
the responses at user level.

MINOR: mux-h2/trace: always report the h2c/h2s state and flags

There's no limitation to just "state" trace level anymore, we're
expected to always show these internal states at verbosity levels
above "clean".

MINOR: mux-h2/trace: only decode the start-line at verbosity other than "minimal"

This is as documented in "trace h2 verbosity", level "minimal" only
features flags and doesn't perform any decoding at all, "simple" does,
just like "clean" which is the default for end uesrs.

MINOR: mux-h2/trace: add a new verbosity level "clean"

The "clean" output will be suitable for user and proto-level output
where the internal stuff (state, pointers, etc) is not desired but
just the basic protocol elements.

MINOR: mux-h2: add functions to convert an h2c/h2s state to a string

We need this all the time in traces, let's have it now. For the sake
of compact outputs, the strings are all 3-chars long. The "show fd"
output was improved to make use of this.

MEDIUM: mux-h2/trace: add lots of traces all over the code

All functions of the h2 data path were updated to receive one or multiple
TRACE() calls, at least one pair of TRACE_ENTER()/TRACE_LEAVE(), and those
manipulating protocol elements have been improved to report frame types,
special state transitions or detected errors. Even with careful tests, no
performance impact was measured when traces are disabled.

They are not completely exploited yet, the callback function tries to
dump a lot about them, but still doesn't provide buffer dumps, nor does
it indicate the stream or connection error codes.

The first argument is always set to the connection when known. The second
argument is set to the h2s when known, sometimes a 3rd argument is set to
a buffer, generally the rxbuf or htx, and occasionally the 4th argument
points to an integer (number of bytes read/sent, error code).

Retrieving a 10kB object produces roughly 240 lines when at developer
level, 35 lines at data level, 27 at state level, and 10 at proto level
and 2 at user level.

For now the headers are not dumped, but the start line are emitted in
each direction at user level.

The patch is marked medium because it touches lots of places, though
it takes care not to change the execution path.

MINOR: mux-h2/trace: add the default decoding callback

The new function h2_trace() is called when relevant by the trace subsystem
in order to provide extra information about the trace being produced. It
can for example display the connection pointer, the stream pointer, etc.
It is declared in the trace source as the default callback as we expect
it to be versatile enough to enrich most traces.

In addition, for requests and responses, if we have a buffer and we can
decode it as an HTX buffer, we can extract user-friendly information from
the start line.

MINOR: mux-h2/trace: register a new trace source with its events

For now the traces are not used. Supported events are categorized by
where the activity comes from (h2c, h2s, stream, etc), a direction
(send/recv/wake), and a list of possibilities for each of them (frame
types, errors, shut, ...). This results in ~50 different events that
try to cover a lot of possibilities when it's needed to filter on
something specific. Special events like protocol error are handled.
A few aggregate events like "rx_frame" or "tx_frame" are planed to
cover all frame types at once by being placed all the time with any
of the other ones.

We also state that the first argument is always the connection. This way
the trace subsystem will be able to safely retrieve some useful info, and
we'll still be able to get the h2c from there (conn->ctx) in a pretty
print function. The second argument will always be an h2s, and in order
to propose it for tracking, we add its description. We also define 4
verbosity levels, which seems more than enough.

MINOR: trace: change the detail_level to per-source verbosity

The detail level initially based on syslog levels is not used, while
something related is missing, trace verbosity, to indicate whether or
not we want to call the decoding callback and what level of decoding
we want (raw captures etc). Let's change the field to "verbosity" for
this. A verbosity of zero means that the decoding callback is not
called, and all other levels are handled by this callback and are
source-specific. The source is now prompted to list the levels that
are proposed to the user. When the source doesn't define anything,
"quiet" and "default" are available.

MINOR: trace: also report the trace level in the output

It's very convenient to know the trace level in the output, at least to
grep/grep -v on it. The main usage is to filter/filter out the developer
traces at level DEVEL. For this we now add the numeric level (0 to 4) just
after the source name in the brackets. The output now looks like this :

  [00|h2|4|mux_h2.c:3174] h2_send(): entering : h2c=0x27d75a0 st=2
   -, -, | ------------,  --------,  -------------------> message
    |  | |             |          '-> function name
    |  | |             '-> source file location
    |  | '-> trace level (4=dev)
    |  '-> trace source
    '-> thread number

MINOR: trace: prepend the function name for developer level traces

Working on adding traces to mux-h2 revealed that the function names are
manually copied a lot in developer traces. The reason is that they are
not preprocessor macros and as such cannot be concatenated. Let's
slightly adjust the trace() function call to take a function name just
after the file:line argument. This argument is only added for the
TRACE_DEVEL and 3 new TRACE_ENTER, TRACE_LEAVE, and TRACE_POINT macros
and left NULL for others. This way the function name is only reported
for traces aimed at the developers. The pretty-print callback was also
extended to benefit from this. This will also significantly shrink the
data segment as the "entering" and "leaving" strings will now be merged.

One technical point worth mentioning is that the function name is *not*
passed as an ist to the inline function because it's not considered as
a builtin constant by the compiler, and would lead to strlen() being
run on it from all call places before calling the inline function. Thus
instead we pass the const char * (that the compiler knows where to find)
and it's the __trace() function that converts it to an ist for internal
consumption and for the pretty-print callback. Doing this avoids losing
5-10% peak performance.

MINOR: trace: change the "payload" level to "data" and move it

The "payload" trace level was ambigous because its initial purpose was
to be able to dump received data. But it doesn't make sense to force to
report data transfers just to be able to report state changes. For
example, all snd_buf()/rcv_buf() operations coming from the application
layer should be tagged at this level. So here we move this payload level
above the state transitions and rename it to avoid the ambiguity making
one think it's only about request/response payload. Now it clearly is
about any data transfer and is thus just below the developer level. The
help messages on the CLI and the doc were slightly reworded to help
remove this ambiguity.

MINOR: trace: replace struct trace_lockon_args with struct name_desc

No need for a specific struct anymore, name_desc suits us.

MINOR: tools: add a generic struct "name_desc" for name-description pairs

In prompts on the CLI we now commonly need to propose a keyword name
and a description and it doesn't make sense to define a new struct for
each such pairs. Let's simply have a generic "name_desc" for this.

MINOR: connection: add the fc_pp_authority fetch -- authority TLV, from PROXYv2

Save the authority TLV in a PROXYv2 header from the client connection,
if present, and make it available as fc_pp_authority.

The fetch can be used, for example, to set the SNI for a backend TLS
connection.

MINOR: trace: change the TRACE() calling convention to put the args and cb last

Previously the callback was almost mandatory so it made sense to have it
before the message. Now that it can default to the one declared in the
trace source, most TRACE() calls contain series of empty args and callbacks,
which make them suitable for being at the end and being totally omitted.

This patch thus reverses the TRACE arguments so that the message appears
first, then the mask, then arg1..arg4, then the callback. In practice
we'll mostly see 1 arg, or 2 args and nothing else, and it will not be
needed anymore to pass long series of commas in the middle of the
arguments. However if a source is enforced, the empty commas will still
be needed for all omitted arguments.

MINOR: trace: show thread number and source name in the trace

Traces were missing the thread number and the source name, which was
really annoying. Now the thread number is emitted on two digits inside
the square brackets, followed by the source name then the line location,
each delimited with a vertical bar, such as below :

  [00|h2|mux_h2.c:2651] Notifying stream about SID change : h2c=0x7f3284581ae0 st=3 h2s=0x7f3284297f00 id=523 st=4
  [00|h2|mux_h2.c:2708] receiving H2 HEADERS frame : h2c=0x7f3284581ae0 st=3 dsi=525 (st=0)
  [02|h2|mux_h2.c:2194] Received H2 request : h2c=0x7f328d3d1ae0 st=2 : [525] H2 REQ: GET / HTTP/2.0
  [02|h2|mux_h2.c:2561] Expecting H2 frame header : h2c=0x7f328d3d1ae0 st=2

MINOR: trace: extend the source location to 13 chars

With 4-digit line numbers, this allows to emit up to 6 chars of file
name before extension, instead of 3 previously.

MINOR: trace: support a default callback for the source

It becomes apparent that most traces will use a single trace pretty
print callback, so let's allow the trace source to declare a default
one so that it can be omitted from trace calls, and will be used if
no other one is specified.

MINOR: sink: now report the number of dropped events on output

The principle is that when emitting a message, if some dropped events
were logged, we first attempt to report this counter before going
further. This is done under an exclusive lock while all logs are
produced under a shared lock. This ensures that the dropped line is
accurately reported and doesn't accidently arrive after a later
event.

MINOR: sink: implement "show events" to show supported sinks and dump the rings

The new "show events" CLI keyword lists supported event sinks. When
passed a buffer-type sink it completely dumps it.

no drops at all during attachment even at 8 millon evts/s.
still missing the attachment limit though.

MINOR: sink: add support for ring buffers

This now provides sink_new_buf() which allocates a ring buffer. One such
ring ("buf0") of 1 MB is created already, and may be used by sink_write().
The sink's creation should probably be moved somewhere else later.

MINOR: ring: add a generic CLI io_handler to dump a ring buffer

The three functions (attach, IO handler, and release) are meant to be
called by any CLI command which requires to dump the contents of a ring
buffer. We do not implement anything generic to dump any ring buffer on
the CLI since it's meant to be used by other functionalities above.
However these functions deal with locking and everything so it's trivial
to embed them in other code.

MINOR: ring: add a ring_write() function

This function tries to write to the ring buffer, possibly removing enough
old messages to make room for the new one. It takes two arrays of fragments
on input to ease the insertion of prefixes by the caller. It atomically
writes the message, possibly truncating it if desired, and returns the
operation's status.

MINOR: ring: add a new mechanism for retrieving/storing ring data in buffers

Our circular buffers are well suited for being used as ring buffers for
not-so-structured data. The machanism here consists in making room in a
buffer before inserting a new record which is prefixed by its size, and
looking up next record based on the previous one's offset and size. We
can have up to 255 consumers watching for data (dump in progress, tail)
which guarantee that entrees are not recycled while they're being dumped.
The complete representation is described in the header file. For now only
ring_new(), ring_resize() and ring_free() are created.

MINOR: sink: now call the generic fd write function

Let's not mess up with fd-specific code, locking nor message formating
here, and use the new generic function instead. This substantially
simplifies the sink_write() code and makes it more agnostic to the
output representation and storage.

MINOR: fd: add fd_write_frag_line() to send a fragmented line to an fd

Currently both logs and event sinks may use a file descriptor to
atomically emit some output contents. The two may use the same FD though
nothing is done to make sure they use the same lock. Also there is quite
some redundancy between the two. Better make a specific function to send
a fragmented message to a file descriptor which will take care of the
locking via the fd's lock. The function is also able to truncate a
message and to enforce addition of a trailing LF when building the
output message.

MINOR: buffer: add functions to read/write varints from/to buffers

The new functions are :
  __b_put_varint() : inserts a varint when it's known that it fits
  b_put_varint()   : tries to insert a varint at the tail
  b_get_varint()   : tries to get a varint from the head
  b_peek_varint()  : tries to peek a varint at a specific offset

Wrapping is supported so that they are expected to be safe to use to
manipulate varints with buffers anywhere.

MINOR: tools: add a function varint_bytes() to report the size of a varint

It will sometimes be useful to encode varints to know the output size in
advance. Two versions are provided, one inline using a switch/case construct
which will be trivial for use with constants (and will be very fast albeit
huge) and one function iterating on the number which is 5 times smaller,
for use with variables.

MINOR: sink: set the fd-type sinks to non-blocking

Just like we used to do for the logs, we must disable blocking on FD
output except if it's a terminal.

MINOR: backend: Add srv_queue converter

The converter can be useful to look up a server queue from a dynamic value.

It takes an input value of type string, either a server name or
<backend>/<server> format and returns the number of queued sessions
on that server. Can be used in places where we want to look up
queued sessions from a dynamic name, like a cookie value (e.g.
req.cook(SRVID),srv_queue) and then make a decision to break
persistence or direct a request elsewhere.

Signed-off-by: Nenad Merdanovic <nmerdan@haproxy.com>

BUG/MEDIUM: url32 does not take the path part into account in the returned hash.

The url32 sample fetch does not take the path part of the URL into
account. This is because in smp_fetch_url32() we erroneously modify
path.len and path.ptr before testing their value and building the
path based part of the hash.

This fixes issue #235

This must be backported as far as 1.9, when HTX was introduced.

BUG/MEDIUM: listener/threads: fix an AB/BA locking issue in delete_listener()

The delete_listener() function takes the listener's lock before taking
the proto_lock, which is contrary to what other functions do, possibly
causing an AB/BA deadlock. In practice the two only places where both
are taken are during protocol_enable_all() and delete_listener(), the
former being used during startup and the latter during stop. In practice
during reload floods, it is technically possible for a thread to be
initializing the listeners while another one is stopping. While this
is too hard to trigger on 2.0 and above due to the synchronization of
all threads during startup, it's reasonably easy to do in 1.9 by having
hundreds of listeners, starting 64 threads and flooding them with reloads
like this :

$ while usleep 50000; do killall -USR2 haproxy; done

Usually in less than a minute, all threads will be deadlocked. The fix
consists in always taking the proto_lock before the listener lock. It
seems to be the only place where these two locks were reversed. This
fix needs to be backported to 2.0, 1.9, and 1.8.

BUG/MINOR: mworker: disable SIGPROF on re-exec

If haproxy is built with profiling enabled with -pg, it is possible to
see the master quit during a reload while it's re-executing itself with
error code 155 (signal 27) saying "Profile timer expired)". This happens
if the SIGPROF signal is delivered during the execve() call while the
handler was already unregistered. The issue itself is not directly inside
haproxy but it's easy to address. This patch disables this signal before
calling execvp() during a master reload. A simple test for this consists
in running this little script with haproxy started in master-worker mode :

$ while usleep 50000; do killall -USR2 haproxy; done

This fix should be backported to all versions using the master-worker
model.

DOC: fixed typo in management.txt

replaced fot -> for
added two periods

BUG/MEDIUM: mux-h1: do not report errors on transfers ending on buffer full

If a receipt ends with the HTX buffer full and everything is completed except
appending the HTX EOM block, we end up detecting an error because the H1
parser did not switch to H1_MSG_DONE yet while all conditions for an end of
stream and end of buffer are met. This can be detected by retrieving 31532
or 31533 chunk-encoded bytes over H1 and seeing haproxy log "SD--" at the
end of a successful transfer.

Ideally the EOM part should be totally independent on the H1 message state
since the block was really parsed and finished. So we should switch to a
last state requiring to send only EOM. However this needs a few risky
changes. This patch aims for simplicity and backport safety, thus it only
adds a flag to the H1 stream indicating that an EOM is still needed, and
excludes this condition from the ones used to detect end of processing. A
cleaner approach needs to be studied, either by adding a state before DONE
or by setting DONE once the various blocks are parsed and before trying to
send EOM.

This fix must be backported to 2.0. The issue does not seem to affect 1.9
though it is not yet known why, probably that it is related to the different
encoding of trailers which always leaves a bit of room to let EOM be stored.

BUG/MEDIUM: mux-h1: do not truncate trailing 0CRLF on buffer boundary

The H1 message parser calls the various message block parsers with an
offset indicating where in the buffer to start from, and only consumes
the data at the end of the parsing. The headers and trailers parsers
have a condition detecting if a headers or trailers block is too large
to fit into the buffer. This is detected by an incomplete block while
the buffer is full. Unfortunately it doesn't take into account the fact
that the block may be parsed after other blocks that are still present
in the buffer, resulting in aborting some transfers early as reported
in issue #231. This typically happens if a trailers block is incomplete
at the end of a buffer full of data, which typically happens with data
sizes multiple of the buffer size minus less than the trailers block
size. It also happens with the CRLF that follows the 0-sized chunk of
any transfer-encoded contents is itself on such a boundary since this
CRLF is technically part of the trailers block. This can be reproduced
by asking a server to retrieve exactly 31532 or 31533 bytes of static
data using chunked encoding with curl, which reports:

transfer closed with outstanding read data remaining

This issue was revealed in 2.0 and does not affect 1.9 because in 1.9
the trailers block was processed at once as part of the data block
processing, and would simply give up and wait for the rest of the data
to arrive.

It's interesting to note that the headers block parsing is also affected
by this issue but in practice it has a much more limited impact since a
headers block is normally only parsed at the beginning of a buffer. The
only case where it seems to matter is when dealing with a response buffer
full of 100-continue header blocks followed by a regular header block,
which will then be rejected for the same reason.

This fix must be backported to 2.0 and partially to 1.9 (the headers
block part).

BUILD: trace: make the lockon_ptr const to silence a warning without threads

I forgot to fix this one before pushing, despite my tests. lockon_ptr is
only used to compare pointers, it doesn't need to point to a writable
location. Without threads the atomic store is turned into an assignment
and rightfully complains.

DOC: management: document the "trace" and "show trace" commands

At the moment the subsystem is still not complete and the various modules
do not yet produce traces (some dirty experimental code for H2 exists) but
this aims at easing a broad adoption.

Among the missing elements, we can enumerate the lack of configuration
of the sinks (e.g. it's still not possible to change their output format
nor enable/disable timestamps) and since timestamps are not availalbe in
the sinks, they are not collected nor passed by the traces.

MINOR: trace: retrieve useful pointers and enforce lock-on

Now we try to find frontend, listener, backend, server, connection,
session, stream, from the presented argument of type connection,
stream or session. Various combinations and bounces allow to
retrieve most of them almost all the time. The extraction is
performed early so that we'll be able to apply filters later.

The lock-on is set if it was not there while the trace is running and
a valid pointer is available. If it was already set and doesn't match,
no trace is produced.

MINOR: trace: parse the "lock" argument to trace

When no criterion is provided, it carefully enumerates all available
ones, including those provided by the source itself. Otherwise it sets
the new criterion and resets the lockon pointer.

MINOR: trace: add the possibility to lock on some arguments

Given that we can pass typed arguments to the trace() function, let's
add provisions for tracking them. They are source-specific so we need
to let the source fill their name and description. Only those with a
non-null name will be proposed.

MINOR: trace: make sure to always stop the locking when stopping or pausing

When we stop or pause a trace (either on a matching event or by hand),
we must also stop the lock-on feature so that we don't follow any
further activity on this pointer even if it is recycled. For now this
is not exploited.

MINOR: trace: add a definition of typed arguments to trace()

With a few macros it's possible for a trace source to commit to only
using a certain type for a given argument (or set of). This will be
particularly useful to let the trace subsystem retrieve some precious
information such as a connection, session, listener, source address or
so, and enable/disable filtering and/or locking.

MINOR: trace: add per-level macros to produce traces

The new TRACE_<level>() macros take a mask, 4 args, a callback and a
static message. From this they also inherit the TRACE_SOURCE macro from
the caller, which contains the pointer to the trace source (so that it's
not required to paste it everywhere), and an ist string is also made by
the concatenation of the file name and the line number. This uses string
concatenation by the preprocessor, and turns it into an ist by the compiler
so that there is no operation at all to perform to adjust the data length
as the compiler knows where to cut during the optimization phase. Last,
the message is also automatically turned into an ist so that it's trivial
to put it into an iovec without having to run strlen() on it.

All arguments and the callback may be empty and will then automatically
be replaced with a NULL pointer. This makes the TRACE calls slightly
lighter especially since arguments are not always used. Several other
options were considered to use variadic macros but there's no outstanding
rule that justifies to place an argument before another one, and it still
looks convenient to have the message be the last one to encourage copy-
pasting of the trace statements.

A generic TRACE() macro takes TRACE_LEVEL in from the source file as the
trace level instead of taking it from its name. This may slightly simplify
the production of traces that always run at the same level (internal core
parts may probably only be called at developer level).

MINOR: trace: implement a call to a decode function

The trace() call will support an optional decoding callback and 4
arguments that this function is supposed to know how to use to provide
extra information. The output remains unchanged when the function is
NULL. Otherwise, the message is pre-filled into the thread-local
trace_buf, and the function is called with all arguments so that it
completes the buffer in a readable form depending on the expected
level of detail.

MINOR: trace: make trace() now also take a level in argument

This new "level" argument will allow the trace sources to label the
traces for different purposes, and filter out some of them if they
are not relevant to the current target. Right now we have 5 different
levels:
  - USER : the least verbose one, only a few functional information
  - PAYLOAD: like user but also displays some payload-related information
  - PROTO: focuses on the protocol's framing
  - STATE: also indicate state internal transitions or non-transitions
  - DEVELOPER: adds extra info about branches taken in the code (break
    points, return points)

MINOR: trace: add the file name and line number in the prefix

We now pass an extra argument "where" to the trace() call, which
is supposed to be an ist made of the concatenation of the filename
and the line number. We only keep the last 10 chars from this string
since the end of file names is most often easy to recognize. This
gives developers useful information at very low cost.

MINOR: trace: implement a very basic trace() function

For now it remains quite basic. It performs a few state checks, calls
the source's sink if defined, and performs the transitions between
RUNNING, STOPPED and WAITING when the configured events match.

MINOR: trace/cli: add "show trace" to report trace state and statistics

The new "show trace" CLI command lists available trace sources and
indicates their status, their sink, and number of dropped packets.
When "show trace <source>" is used, the list of known events is
also listed with their status per action (report/start/stop/pause).

MINOR: trace/cli: parse the "level" argument to configure the trace verbosity

The "level" keyword allows to indicate the expected level of verbosity
in the traces, among "user" (least verbose, just synthetic info) to
"developer" (very detailed, including function entry/leaving). It's only
displayed and set but not used yet.

MINOR: trace/cli: register the "trace" CLI keyword to list the sources

For now it lists the sources if one is not provided, and checks
for the source's existence. It lists the events if not provided,
checks for their existence if provided, and adjusts reported
events/start/stop/pause events, and performs state transitions.
It lists sinks and adjusts them as well. Filters, lock, and
level are not implemented yet.

MINOR: trace: add allocation of buffer-sized trace buffers

This will be needed so that we can implement protocol decoders which will
have to emit their contents into such a buffer.

MINOR: trace: start to create a new trace subsystem

The principle of this subsystem will be to support taking live traces
at various places in the code with conditional triggers, filters, and
ability to lock on some elements. The traces will support typed events
and will be sent into sinks made of ring buffers, file descriptors or
remote servers.