Willy Tarreau [Thu, 22 Feb 2018 13:05:55 +0000 (14:05 +0100)]
BUG/MINOR: pools/threads: don't ignore DEBUG_UAF on double-word CAS capable archs
Since commit cf975d4 ("MINOR: pools/threads: Implement lockless memory
pools."), we support lockless pools. However the parts dedicated to
detecting use-after-free are not present in this part, making DEBUG_UAF
useless in this situation.
The present patch sets a new define CONFIG_HAP_LOCKLESS_POOLS when such
a compatible architecture is detected, and when pool debugging is not
requested, then makes use of this everywhere in pools and buffers
functions. This way enabling DEBUG_UAF will automatically disable the
lockless version.
Tim Duesterhus [Mon, 19 Feb 2018 23:49:46 +0000 (00:49 +0100)]
CLEANUP: pools: Remove unused end label in memory.h
This removes the end label from memory.h.
The labels are unused as of cf975d46bca2515056a4f55e55fedbbc7b4eda59
which is unreleased (and incidentally the first commit containing
those labels, thus they never have been used).
Tim Duesterhus [Mon, 19 Feb 2018 23:49:43 +0000 (00:49 +0100)]
CLEANUP: cfgparse: Remove unused label end
This removes the end label from parse_process_number() which
is unused since 5ab51775e736511b7e54f42e080dcef76a284da9, which
first was released in haproxy 1.8.0.
Returns true when the back connection was made over an SSL/TLS transport
layer and the newly created SSL session was resumed using a cached
session or a TLS ticket.
BUG/MEDIUM: http: Switch the HTTP response in tunnel mode as earlier as possible
When the body length is undefined (no Content-Length or Transfer-Encoding
headers), The reponse remains in ending mode, waiting the request is done. So,
most of time this is not a problem because the resquest is done before the
response. But when a client sends data to a server that replies without waiting
all the data, it is really not desirable to wait the end of the request to
finish the response.
This bug was introduced when the tunneling of the request and the reponse was
refactored, in commit 4be980391 ("MINOR: http: Switch requests/responses in
TUNNEL mode only by checking txn flag").
BUG/MEDIUM: ssl: Shutdown the connection for reading on SSL_ERROR_SYSCALL
When SSL_read returns SSL_ERROR_SYSCALL and errno is unset or set to EAGAIN, the
connection must be shut down for reading. Else, the connection loops infinitly,
consuming all the CPU.
The bug was introduced in the commit 7e2e50500 ("BUG/MEDIUM: ssl: Don't always
treat SSL_ERROR_SYSCALL as unrecovarable."). This patch must be backported in
1.8 too.
Willy Tarreau [Mon, 19 Feb 2018 14:34:12 +0000 (15:34 +0100)]
MINOR: sample: add a new "concat" converter
It's always a pain not to be able to combine variables. This commit
introduces the "concat" converter, which appends a delimiter, a variable's
contents and another delimiter to an existing string. The result is a string.
This makes it easier to build composite variables made of other variables.
BUG/MINOR: ssl/threads: Make management of the TLS ticket keys files thread-safe
A TLS ticket keys file can be updated on the CLI and used in same time. So we
need to protect it to be sure all accesses are thread-safe. Because updates are
infrequent, a R/W lock has been used.
Olivier Houchard [Tue, 13 Feb 2018 14:17:23 +0000 (15:17 +0100)]
BUG/MEDIUM: ssl: Don't always treat SSL_ERROR_SYSCALL as unrecovarable.
Bart Geesink reported some random errors appearing under the form of
termination flags SD in the logs for connections involving SSL traffic
to reach the servers.
Tomek Gacek and Mateusz Malek finally narrowed down the problem to commit c2aae74 ("MEDIUM: ssl: Handle early data with OpenSSL 1.1.1"). It happens
that the special case of SSL_ERROR_SYSCALL isn't handled anymore since
this commit.
SSL_read() might return <= 0, and SSL_get_erro() return SSL_ERROR_SYSCALL,
without meaning the connection is gone. Before flagging the connection
as in error, check the errno value.
Willy Tarreau [Wed, 14 Feb 2018 13:16:28 +0000 (14:16 +0100)]
BUG/MEDIUM: threads: fix the double CAS implementation for ARMv7
Commit f61f0cb ("MINOR: threads: Introduce double-width CAS on x86_64
and arm.") introduced the double CAS. But the ARMv7 version is bogus,
it uses the value of the pointers instead of dereferencing them. When
lucky, it simply doesn't build due to impossible registers combinations.
Otherwise it will immediately crash at run time when facing traffic.
No backport is needed, this bug was introduced in 1.9-dev.
BUG/MINOR: fd/threads: properly lock the FD before adding it to the fd cache.
It was believed that it was useless to lock the "prev" field when adding a
FD. However, if there's only one element in the FD cache, and that element
removes itself from the fd cache, and another FD is added before the first
add completed, there's a risk of losing elements. To prevent that, lock the
"prev" field, so that such a removal will wait until the add completed.
Then haproxy complains :
[WARNING] 334/150131 (23086) : config : frontend 'GLOBAL' has no
'bind' directive. Please declare it as a backend if this was intended.
This is because of the check for a bind-less frontend (the global section
creates a frontend for the stats). There's no clean fix for this one, so
here we're simply checking that the frontend is not the global stats one
before emitting the warning.
This patch should be backported to all stable versions.
Willy Tarreau [Tue, 6 Feb 2018 11:00:27 +0000 (12:00 +0100)]
BUILD: fd/threads: fix breakage build breakage without threads
The last fix for the volatile dereference made use of pl_deref_int()
which is unknown when building without threads. Let's simply open-code
it instead. No backport needed.
Chris Lane [Mon, 5 Feb 2018 23:15:44 +0000 (23:15 +0000)]
MINOR: init: emit warning when -sf/-sd cannot parse argument
Previously, -sf and -sd command line parsing used atol which cannot
detect errors. I had a problem where I was doing -sf "$pid1 $pid2 $pid"
and it was sending the gracefully terminate signal only to the first pid.
The change uses strtol and checks endptr and errno to see if the parsing
worked. It will exit when the pid list is not parsed.
Tim Duesterhus [Sun, 21 Jan 2018 21:11:17 +0000 (22:11 +0100)]
BUG/MEDIUM: standard: Fix memory leak in str2ip2()
An haproxy compiled with:
> make -j4 all TARGET=linux2628 USE_GETADDRINFO=1
And running with a configuration like this:
defaults
log global
mode http
option httplog
option dontlognull
timeout connect 5000
timeout client 50000
timeout server 50000
frontend fe
bind :::8080 v4v6
default_backend be
backend be
server s example.com:80 check
Will leak memory inside `str2ip2()`, because the list `result` is not
properly freed in success cases:
==18875== 140 (76 direct, 64 indirect) bytes in 1 blocks are definitely lost in loss record 87 of 111
==18875== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18875== by 0x537A565: gaih_inet (getaddrinfo.c:1223)
==18875== by 0x537DD5D: getaddrinfo (getaddrinfo.c:2425)
==18875== by 0x4868E5: str2ip2 (standard.c:733)
==18875== by 0x43F28B: srv_set_addr_via_libc (server.c:3767)
==18875== by 0x43F50A: srv_iterate_initaddr (server.c:3879)
==18875== by 0x43F50A: srv_init_addr (server.c:3944)
==18875== by 0x475B30: init (haproxy.c:1595)
==18875== by 0x40406D: main (haproxy.c:2479)
Willy Tarreau [Mon, 5 Feb 2018 19:11:38 +0000 (20:11 +0100)]
BUG/MINOR: time/threads: ensure the adjusted time is always correct
In the time offset calculation loop, we ensure we only commit the new
date once it's futher in the future than the current one. However there
is a small issue here on 32-bit platforms : if global_now is written in
two cycles by another thread, starting with the tv_sec part, and the
current thread reads it in the middle of a change, it may compute a
wrong "adjusted" value on the first round, with the new (larger) tv_sec
and the old (large) tv_usec. This will be detected as the CAS will fail,
and another attempt will be made, but this time possibly with too large
an adusted value, pushing the date further than needed (at worst almost
one second).
This patch addresses this by using a temporary adjusted time in the loop
that always restarts from the last known one, and by assigning the result
to the final value only once the CAS succeeds.
The impact is very limited, it may cause the time to advance in small
jumps on 32 bit platforms and in the worst case some timeouts might
expire 1 second too early.
Willy Tarreau [Mon, 5 Feb 2018 16:52:24 +0000 (17:52 +0100)]
MINOR: fd: reorder fd_add_to_fd_list()
The function was cleaned up a bit from duplicated parts inherited from
the initial attempt at getting it to work. It's a bit smaller and cleaner
this way.
Willy Tarreau [Mon, 5 Feb 2018 18:43:30 +0000 (19:43 +0100)]
BUG/MINOR: fd/threads: properly dereference fdcache as volatile
In fd_rm_from_fd_list(), we have loops waiting for another change to
complete, in case we don't have support for a double CAS. But these
ones fail to place a compiler barrier or to dereference the fdcache
as a volatile, resulting in an endless loop on the first collision,
which is visible when run on MIPS32.
Willy Tarreau [Wed, 17 Jan 2018 20:25:57 +0000 (21:25 +0100)]
MAJOR: fd: compute the new fd polling state out of the fd lock
Each fd_{may|cant|stop|want}_{recv|send} function sets or resets a
single bit at once, then recomputes the need for updates, and then
the new cache state. Later, pollers will compute the new polling
state based on the resulting operations here. In fact the conditions
are so simple that they can be performed by a single "if", or sometimes
even optimized away.
This means that in practice a simple compare-and-swap operation if often
enough to set the new value inluding the new polling state, and that only
the cache and fdupdt have to be performed under the lock. Better, for the
most common operations (fd_may_{recv,send}, used by the pollers), a simple
atomic OR is needed.
This patch does this for the fd_* functions above and it doesn't yet
remove the now useless fd_compute_new_polling_status() because it's still
used by other pollers. A pure connection rate test shows a 1% performance
increase.
Olivier Houchard [Wed, 31 Jan 2018 17:07:29 +0000 (18:07 +0100)]
MEDIUM: fd/threads: Make sure we don't miss a fd cache entry.
An fd cache entry might be removed and added at the end of the list, while
another thread is parsing it, if that happens, we may miss fd cache entries,
to avoid that, add a new field in the struct fdtab, "added_mask", which
contains a mask for potentially affected threads, if it is set, the
corresponding thread will set its bit in fd_cache_mask, to avoid waiting in
poll while it may have more work to do.
Olivier Houchard [Wed, 24 Jan 2018 17:17:56 +0000 (18:17 +0100)]
MAJOR: fd/threads: Make the fdcache mostly lockless.
Create a local, per-thread, fdcache, for file descriptors that only belongs
to one thread, and make the global fd cache mostly lockless, as we can get
a lot of contention on the fd cache lock.
Olivier Houchard [Wed, 29 Nov 2017 18:51:19 +0000 (19:51 +0100)]
MINOR: early data: Never remove the CO_FL_EARLY_DATA flag.
It may be useful to keep the CO_FL_EARLY_DATA flag, so that we know early
data were used, so instead of doing this, only add the Early-data header,
and have the sample fetch ssl_fc_has_early return 1, if CO_FL_EARLY_DATA is
set, and if the handshake isn't done yet.
Olivier Houchard [Mon, 27 Nov 2017 17:41:32 +0000 (18:41 +0100)]
MINOR: early data: Don't rely on CO_FL_EARLY_DATA to wake up streams.
Instead of looking for CO_FL_EARLY_DATA to know if we have to try to wake
up a stream, because it is waiting for a SSL handshake, instead add a new
conn_stream flag, CS_FL_WAIT_FOR_HS. This way we don't have to rely on
CO_FL_EARLY_DATA, and we will only wake streams that are actually waiting.
MINOR: spoe: Add max-waiting-frames directive in spoe-agent configuration
This is the maximum number of frames waiting for an acknowledgement on the same
connection. This value is only used when the pipelinied or asynchronus exchanges
between HAProxy and SPOA are enabled. By default, it is set to 20.
MEDIUM: spoe: Use an ebtree to manage idle applets
Instead of using a list of applets with idle ones in front, we now use an
ebtree. Aapplets in the tree are idle by definition. And the key is the applet's
weight. When a new frame is queued, the first idle applet (with the lowest
weight) is woken up and its weight is increased by one. And when an applet sends
a frame to a SPOA, its weight is decremented by one.
This is empirical, but it should avoid to overuse a very few number of applets
and increase the balancing between idle applets.
MINOR: spoe: Count the number of frames waiting for an ack for each applet
So it is easier to respect the max_fpa value. This is no more the maximum frames
processed by an applet at each loop but the maximum frames waiting for an ack
for a specific applet.
The function spoe_handle_processing_appctx has been rewritten accordingly.
MINOR: spoe: Replace sending_rate by a frequency counter
sending_rate was a counter used to evaluate the SPOE capacity to process
frames. Because it was not really accurrate, it has been replaced by a frequency
counter representing the number of frames handled by the SPOE per second. We
just check this counter is higher than the number of streams waiting for a
reply. If not, a new applet is created.
MINOR: spoe: Remove check on min_applets number when a SPOE context is queued
The calculation of a minimal number of active applets was really empirical and
finally useless. On heavy load, there are always many active applets (most of
time, more than the minimal required) and when the load is low, there is no
reason to keep unused applets opened.
Because of this change, the flag SPOE_APPCTX_FL_PERSIST is now unused. So it has
been removed.
BUG/MEDIUM: spoe: Allow producer to read and to forward shutdown on request side
This is mandatory to correctly set right timeout on the stream. Else the client
timeout is never set. So only SPOE processing timeout will be evaluated. If it
is not defined (ie infinity), the stream can be blocked for a while, waiting the
SPOA reply. Of course, this is not a good idea to let the SPOE processing
timeout undefined, but it can happen.
BUG/MEDIUM: spoe: Always try to receive or send the frame to detect shutdowns
Before, we checked if the buffer was allocated or not to avoid sending or
receiving a frame. This was done to not call ci_putblk or co_getblk if there is
nothing to do. But the checks on the buffers are also done in these
functions. So this is not mandatory here. But in these functions, the channel
state is also checked, so an error is returned if it is closed. By skipping the
call, we also skip the checks on the channel state, delaying shutdowns
detection.
Now, we always try to send or receive a frame. So if the corresponding channel
is closed, we can immediatly handle the error.
Emmanuel Hocdet [Thu, 1 Feb 2018 14:20:32 +0000 (15:20 +0100)]
MINOR: introduce proxy-v2-options for send-proxy-v2
Proxy protocol v2 can transport many optional informations. To avoid
send-proxy-v2-* explosion, this patch introduce proxy-v2-options parameter
and will allow to write: "send-proxy-v2 proxy-v2-options ssl,cert-cn".
Lukas Tribus [Thu, 1 Feb 2018 22:58:59 +0000 (23:58 +0100)]
DOC: don't suggest using http-server-close
Remove the old suggestion to use http-server-close mode, from the
beginnings of keep-alive mode in commit 16bfb021 "MINOR: config: add
option http-keep-alive").
We made http-keep-alive default in commit 70dffdaa "MAJOR: http:
switch to keep-alive mode by default".
Willy Tarreau [Wed, 31 Jan 2018 08:49:29 +0000 (09:49 +0100)]
BUG/MINOR: epoll/threads: only call epoll_ctl(DEL) on polled FDs
Commit d9e7e36 ("BUG/MEDIUM: epoll/threads: use one epoll_fd per thread")
addressed an issue with the polling and required that cloned FDs are removed
from all polling threads on close. But in fact it does it for all bound
threads, some of which may not necessarily poll the FD. This is harmless,
but it may also make it harder later to deal with FD migration between
threads. Better use polled_mask which only reports threads still aware
of the FD instead of thread_mask.
BUG/MINOR: threads: Update labels array because of changes in lock_label enum
Recent changes to the enum were not synchronized with the lock debugging
code. Now we use a switch/case instead of an array so that the compiler
throws a warning if there is any inconsistency.
To be backported to 1.8 (at least to add the START entry).
Willy Tarreau [Thu, 25 Jan 2018 06:22:13 +0000 (07:22 +0100)]
MINOR: fd: pass the iocb and owner to fd_insert()
fd_insert() is currently called just after setting the owner and iocb,
but proceeding like this prevents the operation from being atomic and
requires a lock to protect the maxfd computation in another thread from
meeting an incompletely initialized FD and computing a wrong maxfd.
Fortunately for now all fdtab[].owner are set before calling fd_insert(),
and the first lock in fd_insert() enforces a memory barrier so the code
is safe.
This patch moves the initialization of the owner and iocb to fd_insert()
so that the function will be able to properly arrange its operations and
remain safe even when modified to become lockless. There's no other change
beyond the internal API.
Willy Tarreau [Thu, 25 Jan 2018 16:11:33 +0000 (17:11 +0100)]
MEDIUM: poll: don't use the old FD state anymore
The polling updates are now performed exactly like the epoll/kqueue
ones : only the new polled state is considered, and the previous one
is checked using polled_mask. The only specific stuff here is that
the fd state is shared between all threads, so an FD removal has to
be done only once.
Willy Tarreau [Thu, 25 Jan 2018 16:09:33 +0000 (17:09 +0100)]
MEDIUM: select: don't use the old FD state anymore
The polling updates are now performed exactly like the epoll/kqueue
ones : only the new polled state is considered, and the previous one
is checked using polled_mask. The only specific stuff here is that
the fd state is shared between all threads, so an FD removal has to
be done only once.
Willy Tarreau [Thu, 25 Jan 2018 15:48:46 +0000 (16:48 +0100)]
MEDIUM: select: make use of hap_fd_* functions
Given that FD_{CLR,SET} are not always guaranteed to be thread safe,
let's fall back to using the hap_fd_* functions as we used to till
1.5-dev18 and as poll() continues to use. This will make it easier
to remove the poll_lock.
Willy Tarreau [Thu, 25 Jan 2018 15:37:04 +0000 (16:37 +0100)]
MINOR: fd: move the hap_fd_{clr,set,isset} functions to fd.h
These functions were created for poll() in 1.5-dev18 (commit 80da05a4) to
replace the previous FD_{CLR,SET,ISSET} that were shared with select()
because some libcs enforce a limit on FD_SET. But FD_SET doesn't seem
to be universally MT-safe, requiring locks in the select() code that
are not needed in the poll code. So let's move back to the initial
situation where we used to only use bit fields, since that has been in
use since day one without a problem, and let's use these hap_fd_*
functions instead of FD_*.
This patch only moves the functions to fd.h and revives hap_fd_isset()
that was recently removed to kill an "unused" warning.
Willy Tarreau [Mon, 29 Jan 2018 14:56:24 +0000 (15:56 +0100)]
MINOR: poll: more accurately compute the new maxfd in the loop
Last commit 173d995 ("MEDIUM: polling: start to move maxfd computation
to the pollers") moved the maxfd computation to the polling loop, but
it still adds an entry when removing an fd, forcing the next loop to
seek from further away than necessary. Let's only update the max when
actually adding an entry.
Willy Tarreau [Fri, 26 Jan 2018 20:48:23 +0000 (21:48 +0100)]
MEDIUM: polling: start to move maxfd computation to the pollers
Since only select() and poll() still make use of maxfd, let's move
its computation right there in the pollers themselves, and only
during each fd update pass. The computation doesn't need a lock
anymore, only a few atomic ops. It will be accurate, be done much
less often and will not be required anymore in the FD's fast patch.
This provides a small performance increase of about 1% in connection
rate when using epoll since we get rid of this computation which was
performed under a lock.
Willy Tarreau [Mon, 29 Jan 2018 14:06:04 +0000 (15:06 +0100)]
MINOR: fd: don't report maxfd in alert messages
The listeners and connectors may complain that process-wide or
system-wide FD limits have been reached and will in this case report
maxfd as the limit. This is wrong in fact since there's no reason for
the whole FD space to be contiguous when the total # of FD is reached.
A better approach would consist in reporting the accurate number of
opened FDs, but this is pointless as what matters here is to give a
hint about what might be wrong. So let's simply report the configured
maxsock, which will generally explain why the process' limits were
reached, which is the most common reason. This removes another
dependency on maxfd.
Willy Tarreau [Mon, 29 Jan 2018 13:58:02 +0000 (14:58 +0100)]
MINOR: polling: make epoll and kqueue not depend on maxfd anymore
Maxfd is really only useful to poll() and select(), yet epoll and
kqueue reference it almost by mistake :
- cloning of the initial FDs (maxsock should be used here)
- max polled events, it's maxpollevents which should be used here.
Willy Tarreau [Mon, 29 Jan 2018 14:17:05 +0000 (15:17 +0100)]
BUG/MINOR: cli: use global.maxsock and not maxfd to list all FDs
The "show fd" command on the CLI doesn't list the last FD in use since
it doesn't include maxfd. We don't need to use maxfd here anyway as
global.maxsock will do the job pretty well and removes this dependency.
This patch may be backported to 1.8.
Tim Duesterhus [Thu, 25 Jan 2018 15:24:51 +0000 (16:24 +0100)]
MEDIUM: sample: Add IPv6 support to the ipmask converter
Add an optional second parameter to the ipmask converter that specifies
the number of bits to mask off IPv6 addresses.
If the second parameter is not given IPv6 addresses fail to mask (resulting
in an empty string), preserving backwards compatibility: Previously
a sample like `src,ipmask(24)` failed to give a result for IPv6 addresses.
This feature can be tested like this:
defaults
log global
mode http
option httplog
option dontlognull
timeout connect 5000
timeout client 50000
timeout server 50000
frontend fe
bind :::8080 v4v6
# Masked IPv4 for IPv4, empty for IPv6 (with and without this commit)
http-response set-header Test %[src,ipmask(24)]
# Correctly masked IP addresses for both IPv4 and IPv6
http-response set-header Test2 %[src,ipmask(24,ffff:ffff:ffff:ffff::)]
# Correctly masked IP addresses for both IPv4 and IPv6
http-response set-header Test3 %[src,ipmask(24,64)]
BUILD: epoll/threads: Add test on MAX_THREADS to avoid warnings when complied without threads
When HAProxy is complied without threads, gcc throws following warnings:
src/ev_epoll.c:222:3: warning: array subscript is outside array bounds [-Warray-bounds]
...
src/ev_epoll.c:199:11: warning: array subscript is outside array bounds [-Warray-bounds]
...
Of course, this is not a bug. In such case, tid is always equal to 0. But to
avoid the noise, a check on MAX_THREADS in "if (tid)" lines makes gcc happy.
This patch should be backported in 1.8 with the commit d9e7e36c ("BUG/MEDIUM:
epoll/threads: use one epoll_fd per thread").
MINOR: threads: Use __decl_hathreads instead of #ifdef/#endif
A #ifdef/#endif on USE_THREAD was added in the commit 0048dd04 ("MINOR: threads:
Fix build when we're not compiling with threads.") to conditionally define the
start_lock variable, because HA_SPINLOCK_T is only defined when HAProxy is
compiled with threads.
If fact, to do that, we should use the macro __decl_hathreads instead.
If commit 0048dd04 is backported in 1.8, this one can also be backported.
BUG/MEDIUM: checks: Don't try to release undefined conn_stream when a check is freed
When a healt-check is released, the attached conn_stream may be undefined. For
instance, this happens when 'no-check' option is used on a server line. So we
must check it is defined before trying to release it.
BUG/MEDIUM: threads/server: Fix deadlock in srv_set_stopping/srv_set_admin_flag
Because of a typo (HA_SPIN_LOCK instead of HA_SPIN_UNLOCK), there is a deadlock
in srv_set_stopping and srv_set_admin_flag when there is at least one trackers.
Willy Tarreau [Thu, 25 Jan 2018 06:28:37 +0000 (07:28 +0100)]
BUG/MINOR: threads: always set an owner to the thread_sync pipe
The owner of the fd used by the synchronization pipe was set to NULL,
making it ignored by maxfd computation. The risk would be that some
synchronization events get delayed between threads when using poll()
or select(). However this is only theorical since the pipe is created
before listeners are bound so normally its FD should be lower and
this should normally not happen. The only possible situation would
be if all listeners are bound to inherited FDs which are lower than
the pipe's.
Olivier Houchard [Wed, 24 Jan 2018 14:41:04 +0000 (15:41 +0100)]
MINOR: threads: Fix build when we're not compiling with threads.
Only declare the start_lock if threads are compiled in, otherwise
HA_SPINLOCK_T won't be defined.
This should be backported to 1.8 when/if 1605c7ae6154d8c2cfcf3b325872b1a7266c5bc2 is backported.
Willy Tarreau [Tue, 23 Jan 2018 18:01:49 +0000 (19:01 +0100)]
BUG/MEDIUM: threads/mworker: fix a race on startup
Marc Fournier reported an interesting case when using threads with the
master-worker mode : sometimes, a listener would have its FD closed
during startup. Sometimes it could even be health checks seeing this.
What happens is that after the threads are created, and the pollers
enabled on each threads, the master-worker pipe is registered, and at
the same time a close() is performed on the write side of this pipe
since the children must not use it.
But since this is replicated in every thread, what happens is that the
first thread closes the pipe, thus releases the FD, and the next thread
starting a listener in parallel gets this FD reassigned. Then another
thread closes the FD again, which this time corresponds to the listener.
It can also happen with the health check sockets if they're started
early enough.
This patch splits the mworker_pipe_register() function in two, so that
the close() of the write side of the FD is performed very early after the
fork() and long before threads are created (we don't need to delay it
anyway). Only the pipe registration is done in the threaded code since
it is important that the pollers are properly allocated for this.
The mworker_pipe_register() function now takes care of registering the
pipe only once, and this is guaranteed by a new surrounding lock.
The call to protocol_enable_all() looks fragile in theory since it
scans the list of proxies and their listeners, though in practice
all threads scan the same list and take the same locks for each
listener so it's not possible that any of them escapes the process
and finishes before all listeners are started. And the operation is
idempotent.
This fix must be backported to 1.8. Thanks to Marc for providing very
detailed traces clearly showing the problem.
Willy Tarreau [Fri, 19 Jan 2018 07:56:14 +0000 (08:56 +0100)]
BUG/MEDIUM: kqueue/threads: use one kqueue_fd per thread
This is the same principle as the previous patch (BUG/MEDIUM:
epoll/threads: use one epoll_fd per thread) except that this time it's
for kqueue. We don't want all threads to wake up because of activity on
a single other thread that the other ones are not interested in.
Just like with previous patch, this one shows that the polling state
doesn't need to be changed here and that some simplifications are now
possible. This patch only implements the minimum required for a stable
backport.
Willy Tarreau [Thu, 18 Jan 2018 18:16:02 +0000 (19:16 +0100)]
BUG/MEDIUM: epoll/threads: use one epoll_fd per thread
There currently is a problem regarding epoll(). While select() and poll()
compute their polling state on the fly upon each call, epoll() keeps a
shared state between all threads via the epoll_fd. The problem is that
once an fd is registered on *any* thread, all other threads receive
events for that FD as well. It is clearly visible when binding a listener
to a single thread like in the configuration below where all 4 threads
will work, 3 of them simply spinning to skip the event :
global
nbthread 4
frontend foo
bind :1234 process 1/1
The worst case happens when some slow operations are in progress on a
busy thread, preventing it from processing its task and causing the
other ones to wake up not being able to do anything with this event.
Typically computing a large TLS key will delay processing of next
events on the same thread while others will still wake up.
All this simply shows that the poller must remain thread-specific, with
its own events and its own ability to sleep when it doesn't have anyhing
to do.
This patch does exactly this. For this, it proceeds like this :
- have one epoll_fd per thread instead of one per process
- initialize these epoll_fd when threads are created.
- mark all known FDs as updated so that the next invocation of
_do_poll() recomputes their polling status (including a possible
removal of undesired polling from the original FD) ;
- use each fd's polled_mask to maintain an accurate status of
the current polling activity for this FD.
- when scanning updates, only focus on events whose new polling
status differs from the existing one
- during updates, always verify the thread_mask to resist migration
- on __fd_clo(), for cloned FDs (typically listeners inherited
from the parent during a graceful shutdown), run epoll_ctl(DEL)
on all epoll_fd. This is the reason why epoll_fd is stored in a
shared array and not in a thread_local storage. Note: maybe this
can be moved to an update instead.
Interestingly, this shows that we don't need the FD's old state anymore
and that we only use it to convert it to the new state based on stable
information. It appears clearly that the FD code can be further improved
by computing the final state directly when manipulating it.
With this change, the config above goes from 22000 cps at 380% CPU to
43000 cps at 100% CPU : not only the 3 unused threads are not activated,
but they do not disturb the activity anymore.
The output of "show activity" before and after the patch on a 4-thread
config where a first listener on thread 2 forwards over SSL to threads
3 & 4 shows this a much smaller amount of undesired events (thread 1
doesn't wake up anymore, poll_skip remains zero, fd_skip stays low) :
This patch presents a few risks but fixes a real problem with threads,
and as such it needs be backported to 1.8. It depends on previous patch
("MINOR: fd: add a bitmask to indicate that an FD is known by the poller").
Special thanks go to Samuel Reed for providing a large amount of useful
debugging information and for testing fixes.
Willy Tarreau [Wed, 17 Jan 2018 17:44:46 +0000 (18:44 +0100)]
MINOR: fd: add a bitmask to indicate that an FD is known by the poller
Some pollers like epoll() need to know if the fd is already known or
not in order to compute the operation to perform (add, mod, del). For
now this is performed based on the difference between the previous FD
state and the new state but this will not be usable anymore once threads
become responsible for their own polling.
Here we come with a different approach : a bitmask is stored with the
fd to indicate which pollers already know it, and the pollers will be
able to simply perform the add/mod/del operations based on this bit
combined with the new state.
This patch only adds the bitmask declaration and initialization, it
is it not yet used. It will be needed by the next two fixes and will
need to be backported to 1.8.
Willy Tarreau [Sat, 20 Jan 2018 22:53:50 +0000 (23:53 +0100)]
BUG/MEDIUM: fd: maintain a per-thread update mask
Since the fd update tables are per-thread, we need to have a bit per
thread to indicate whether an update exists, otherwise this can lead
to lost update events every time multiple threads want to update the
same FD. In practice *for now*, it only happens at start time when
listeners are enabled and ask for polling after facing their first
EAGAIN. But since the pollers are still shared, a lost event is still
recovered by a neighbor thread. This will not reliably work anymore
with per-thread pollers, where it has been observed a few times on
startup that a single-threaded listener would not always accept
incoming connections upon startup.
It's worth noting that during this code review it appeared that the
"new" flag in the fdtab isn't used anymore.
BUG/MEDIUM: threads/polling: Use fd_cache_mask instead of fd_cache_num
fd_cache_num is the number of FDs in the FD cache. It is a global variable. So
it is underoptimized because we may be lead to consider there are waiting FDs
for the current thread in the FD cache while in fact all FDs are assigned to the
other threads. So, in such cases, the polling loop will be evaluated many more
times than necessary.
Instead, we now check if the thread id is set in the bitfield fd_cache_mask.
[wt: it's not exactly a bug, rather a design limitation of the thread
which was not addressed in time for the 1.8 release. It can appear more
often than we initially predicted, when more threads are running than
the number of assigned CPU cores, or when certain threads spend
milliseconds computing crypto keys while other threads spin on
epoll_wait(0)=0]
MINOR: threads/fd: Use a bitfield to know if there are FDs for a thread in the FD cache
A bitfield has been added to know if there are some FDs processable by a
specific thread in the FD cache. When a FD is inserted in the FD cache, the bits
corresponding to its thread_mask are set. On each thread, the bitfield is
updated when the FD cache is processed. If there is no FD processed, the thread
is removed from the bitfield by unsetting its tid_bit.
Note that this bitfield is updated but not checked in
fd_process_cached_events. So, when this function is called, the FDs cache is
always processed.
[wt: should be backported to 1.8 as it will help fix a design limitation]
Willy Tarreau [Sat, 20 Jan 2018 18:30:13 +0000 (19:30 +0100)]
MINOR: global: add some global activity counters to help debugging
A number of counters have been added at special places helping better
understanding certain bug reports. These counters are maintained per
thread and are shown using "show activity" on the CLI. The "clear
counters" commands also reset these counters. The output is sent as a
single write(), which currently produces up to about 7 kB of data for
64 threads. If more counters are added, it may be necessary to write
into multiple buffers, or to reset the counters.
To backport to 1.8 to help collect more detailed bug reports.