From: Willy Tarreau Date: Wed, 28 Jul 2021 13:27:37 +0000 (+0200) Subject: BUG/MEDIUM: connection: close a rare race between idle conn close and takeover X-Git-Tag: v2.5-dev3~15 X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=6ed242ece6b5bc5c99a10350ded1bdd6321dde8b;p=thirdparty%2Fhaproxy.git BUG/MEDIUM: connection: close a rare race between idle conn close and takeover The takeover of idle conns between threads is particularly tricky, for two reasons: - there's no way to atomically synchronize kernel-side polling with userspace activity, so late events will always be reported for some FDs just migrated ; - upon error, an FD may be immediately reassigned to whatever other thread since it's process-wide. The current model uses the FD's thread_mask to figure if an FD still ought to be reported or not, and a per-thread idle connection queue from which eligible connections are atomically added/picked. I/Os coming from the bottom for such a connection must remove it from the list so that it's not elected. Same for timeout tasks and iocbs. And these last ones check their context under the idle_conn lock to judge if they're still allowed to run. One rare case was omitted: the wake() callback. This one is rare, it may serve to notify about finalized connect() calls that are not being polled, as well as unhandled shutdowns and errors. This callback was not protected till now because it wasn't seen as sensitive, but there exists a particular case where it may be called without protectoin in parallel to a takeover. This happens in the following sequence: - thread T1 wants to establish an outgoing connection - the connect() call returns EINPROGRESS - the poller adds it using epoll_ctl() - epoll_wait() reports it, connect() is done. The connection is not being marked as actively polled anymore but is still known from the poller. - the request is sent over that connection using send(), which queues to system buffers while data are being delivered - the scheduler switches to other tasks - the request is physically sent - the server responds - the stream is notified that send() succeeded, and makes progress, trying to recv() from that connection - the recv() succeeds, the response is delivered - the poller doesn't need to be touched (still no active polling) - the scheduler switches to other tasks - the server closes the connection - the poller on T1 is notified of the SHUTR and starts to call mux->wake() - another thread T2 takes over the connection - T2 continues to run inside wake() and releases the connection - T2 is just dereferencing it. - BAM. The most logical solution here is to surround the call to wake() with an atomic removal/insert of the connection from/into the idle conns lists. This way, wake() is guaranteed to run alone. Any other poller reporting the FD will not have its tid_bit in the thread_mask si will not bother it. Another thread trying a takeover will not find this connection. A task or tasklet being woken up late will either be on the same thread, or be called on another one with a NULL context since it will be the consequence of previous successful takeover, and will be ignored. Note that the extra cost of a lock and tree access here have a low overhead which is totally amortized given that these ones roughly happen 1-2 times per connection at best. While it was possible to crash the process after 10-100k req using H2 and a hand-refined configuration achieving perfect synchronism between a long (20+) chain of proxies and a short timeout (1ms), now with that fix this never happens even after 10M requests. Many thanks to Olivier for proposing this solution and explaining why it works. This should be backported as far as 2.2 (when inter-thread takeover was introduced). The code in older versions will be found in conn_fd_handler(). A workaround consists in disabling inter-thread pool sharing using: tune.idle-pool.shared off --- diff --git a/include/haproxy/connection.h b/include/haproxy/connection.h index f766578f9c..77afb2bb00 100644 --- a/include/haproxy/connection.h +++ b/include/haproxy/connection.h @@ -247,6 +247,16 @@ static inline int conn_notify_mux(struct connection *conn, int old_flags, int fo * - end of HANDSHAKE with CONNECTED set * - regular data layer activity * + * One tricky case is the wake up on read0 or error on an idle + * backend connection, that can happen on a connection that is still + * polled while at the same moment another thread is about to perform a + * takeover. The solution against this is to remove the connection from + * the idle list if it was in it, and possibly reinsert it at the end + * if the connection remains valid. The cost is non-null (locked tree + * removal) but remains low given that this is extremely rarely called. + * In any case it's guaranteed by the FD's thread_mask that we're + * called from the same thread the connection is queued in. + * * Note that the wake callback is allowed to release the connection and * the fd (and return < 0 in this case). */ @@ -254,9 +264,28 @@ static inline int conn_notify_mux(struct connection *conn, int old_flags, int fo ((conn->flags ^ old_flags) & CO_FL_NOTIFY_DONE) || ((old_flags & CO_FL_WAIT_XPRT) && !(conn->flags & CO_FL_WAIT_XPRT))) && conn->mux && conn->mux->wake) { + uint conn_in_list = conn->flags & CO_FL_LIST_MASK; + struct server *srv = objt_server(conn->target); + + if (conn_in_list) { + HA_SPIN_LOCK(IDLE_CONNS_LOCK, &idle_conns[tid].idle_conns_lock); + conn_delete_from_tree(&conn->hash_node->node); + HA_SPIN_UNLOCK(IDLE_CONNS_LOCK, &idle_conns[tid].idle_conns_lock); + } + ret = conn->mux->wake(conn); if (ret < 0) goto done; + + if (conn_in_list) { + struct eb_root *root = (conn_in_list == CO_FL_SAFE_LIST) ? + &srv->per_thr[tid].safe_conns : + &srv->per_thr[tid].idle_conns; + + HA_SPIN_LOCK(IDLE_CONNS_LOCK, &idle_conns[tid].idle_conns_lock); + ebmb_insert(root, &conn->hash_node->node, sizeof(conn->hash_node->hash)); + HA_SPIN_UNLOCK(IDLE_CONNS_LOCK, &idle_conns[tid].idle_conns_lock); + } } done: return ret;