From: Willy Tarreau <w@1wt.eu>
Date: Wed, 28 Jul 2021 13:27:37 +0000 (+0200)
Subject: BUG/MEDIUM: connection: close a rare race between idle conn close and takeover
X-Git-Tag: v2.5-dev3~15
X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=6ed242ece6b5bc5c99a10350ded1bdd6321dde8b;p=thirdparty%2Fhaproxy.git

BUG/MEDIUM: connection: close a rare race between idle conn close and takeover

The takeover of idle conns between threads is particularly tricky, for
two reasons:
  - there's no way to atomically synchronize kernel-side polling with
    userspace activity, so late events will always be reported for some
    FDs just migrated ;

  - upon error, an FD may be immediately reassigned to whatever other
    thread since it's process-wide.

The current model uses the FD's thread_mask to figure if an FD still
ought to be reported or not, and a per-thread idle connection queue
from which eligible connections are atomically added/picked. I/Os
coming from the bottom for such a connection must remove it from the
list so that it's not elected. Same for timeout tasks and iocbs. And
these last ones check their context under the idle_conn lock to judge
if they're still allowed to run.

One rare case was omitted: the wake() callback. This one is rare, it
may serve to notify about finalized connect() calls that are not being
polled, as well as unhandled shutdowns and errors. This callback was
not protected till now because it wasn't seen as sensitive, but there
exists a particular case where it may be called without protectoin in
parallel to a takeover. This happens in the following sequence:

  - thread T1 wants to establish an outgoing connection
  - the connect() call returns EINPROGRESS
  - the poller adds it using epoll_ctl()
  - epoll_wait() reports it, connect() is done. The connection is
    not being marked as actively polled anymore but is still known
    from the poller.
  - the request is sent over that connection using send(), which
    queues to system buffers while data are being delivered
  - the scheduler switches to other tasks
  - the request is physically sent
  - the server responds
  - the stream is notified that send() succeeded, and makes progress,
    trying to recv() from that connection
  - the recv() succeeds, the response is delivered
  - the poller doesn't need to be touched (still no active polling)
  - the scheduler switches to other tasks
  - the server closes the connection
  - the poller on T1 is notified of the SHUTR and starts to call mux->wake()
  - another thread T2 takes over the connection
  - T2 continues to run inside wake() and releases the connection
  - T2 is just dereferencing it.
  - BAM.

The most logical solution here is to surround the call to wake() with an
atomic removal/insert of the connection from/into the idle conns lists.
This way, wake() is guaranteed to run alone. Any other poller reporting
the FD will not have its tid_bit in the thread_mask si will not bother
it. Another thread trying a takeover will not find this connection. A
task or tasklet being woken up late will either be on the same thread,
or be called on another one with a NULL context since it will be the
consequence of previous successful takeover, and will be ignored. Note
that the extra cost of a lock and tree access here have a low overhead
which is totally amortized given that these ones roughly happen 1-2
times per connection at best.

While it was possible to crash the process after 10-100k req using H2
and a hand-refined configuration achieving perfect synchronism between
a long (20+) chain of proxies and a short timeout (1ms), now with that
fix this never happens even after 10M requests.

Many thanks to Olivier for proposing this solution and explaining why
it works.

This should be backported as far as 2.2 (when inter-thread takeover was
introduced). The code in older versions will be found in conn_fd_handler().
A workaround consists in disabling inter-thread pool sharing using:

    tune.idle-pool.shared off
---

diff --git a/include/haproxy/connection.h b/include/haproxy/connection.h
index f766578f9c..77afb2bb00 100644
--- a/include/haproxy/connection.h
+++ b/include/haproxy/connection.h
@@ -247,6 +247,16 @@ static inline int conn_notify_mux(struct connection *conn, int old_flags, int fo
 	 *  - end of HANDSHAKE with CONNECTED set
 	 *  - regular data layer activity
 	 *
+	 * One tricky case is the wake up on read0 or error on an idle
+	 * backend connection, that can happen on a connection that is still
+	 * polled while at the same moment another thread is about to perform a
+	 * takeover. The solution against this is to remove the connection from
+	 * the idle list if it was in it, and possibly reinsert it at the end
+	 * if the connection remains valid. The cost is non-null (locked tree
+	 * removal) but remains low given that this is extremely rarely called.
+	 * In any case it's guaranteed by the FD's thread_mask that we're
+	 * called from the same thread the connection is queued in.
+	 *
 	 * Note that the wake callback is allowed to release the connection and
 	 * the fd (and return < 0 in this case).
 	 */
@@ -254,9 +264,28 @@ static inline int conn_notify_mux(struct connection *conn, int old_flags, int fo
 	     ((conn->flags ^ old_flags) & CO_FL_NOTIFY_DONE) ||
 	     ((old_flags & CO_FL_WAIT_XPRT) && !(conn->flags & CO_FL_WAIT_XPRT))) &&
 	    conn->mux && conn->mux->wake) {
+		uint conn_in_list = conn->flags & CO_FL_LIST_MASK;
+		struct server *srv = objt_server(conn->target);
+
+		if (conn_in_list) {
+			HA_SPIN_LOCK(IDLE_CONNS_LOCK, &idle_conns[tid].idle_conns_lock);
+			conn_delete_from_tree(&conn->hash_node->node);
+			HA_SPIN_UNLOCK(IDLE_CONNS_LOCK, &idle_conns[tid].idle_conns_lock);
+		}
+
 		ret = conn->mux->wake(conn);
 		if (ret < 0)
 			goto done;
+
+		if (conn_in_list) {
+			struct eb_root *root = (conn_in_list == CO_FL_SAFE_LIST) ?
+				&srv->per_thr[tid].safe_conns :
+				&srv->per_thr[tid].idle_conns;
+
+			HA_SPIN_LOCK(IDLE_CONNS_LOCK, &idle_conns[tid].idle_conns_lock);
+			ebmb_insert(root, &conn->hash_node->node, sizeof(conn->hash_node->hash));
+			HA_SPIN_UNLOCK(IDLE_CONNS_LOCK, &idle_conns[tid].idle_conns_lock);
+		}
 	}
  done:
 	return ret;