From: Willy Tarreau <w@1wt.eu>
Date: Wed, 16 Oct 2024 16:08:39 +0000 (+0200)
Subject: BUG/MEDIUM: queue: make sure never to queue when there's no more served conns
X-Git-Tag: v3.1-dev10~62
X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=ca275d99ce02e72d707fc87da133d739cdda5146;p=thirdparty%2Fhaproxy.git

BUG/MEDIUM: queue: make sure never to queue when there's no more served conns

Since commit 53f52e67a0 ("BUG/MEDIUM: queue: always dequeue the backend when
redistributing the last server"), we've got two reports again still showing
the theoretically impossible condition in pendconn_add(), including a single
threaded one.

Thanks to the traces, the issue could be tracked down to the redispatch part.
In fact, in non-determinist LB algorithms (RR, LC, FAS), we don't perform the
LB if there are pending connections in the backend, since it indicates that
previous attempts already failed, so we directly return SRV_STATUS_FULL. And
contrary to a previous belief, it is possible to meet this condition with
be->served==0 when redispatching (and likely with maxconn not greater than
the number of threads).

The problem is that in this case, the entry is queued and then the
pendconn_must_try_again() function checks if any connections are currently
being served to detect if we missed a race, and tries again, but that
situation is not caused by a concurrent thread and will never fix itself,
resulting in the loop.

All that part around pendconn_must_try_again() is still quite brittle, and
a safer approach would involve a sequence counter to detect new arrivals
and dequeues during the pendconn_add() call. But it's more sensitive work,
probably for a later fix.

This fix must be backported wherever the fix above was backported. Thanks
to Patrick Hemmer, as well as Damien Claisse and Basha Mougamadou from
Criteo for their help on tracking this one!
---

diff --git a/src/backend.c b/src/backend.c
index b2f924c278..c8bb33e81c 100644
--- a/src/backend.c
+++ b/src/backend.c
@@ -681,7 +681,7 @@ int assign_server(struct stream *s)
 		/* if there's some queue on the backend, with certain algos we
 		 * know it's because all servers are full.
 		 */
-		if (s->be->queue.length && s->be->queue.length != s->be->beconn &&
+		if (s->be->queue.length && s->be->served && s->be->queue.length != s->be->beconn &&
 		    (((s->be->lbprm.algo & (BE_LB_KIND|BE_LB_NEED|BE_LB_PARM)) == BE_LB_ALGO_FAS)||   // first
 		     ((s->be->lbprm.algo & (BE_LB_KIND|BE_LB_NEED|BE_LB_PARM)) == BE_LB_ALGO_RR) ||   // roundrobin
 		     ((s->be->lbprm.algo & (BE_LB_KIND|BE_LB_NEED|BE_LB_PARM)) == BE_LB_ALGO_SRR))) { // static-rr