From: Willy Tarreau <w@1wt.eu>
Date: Tue, 6 May 2025 16:55:04 +0000 (+0200)
Subject: BUG/MAJOR: queue: lock around the call to pendconn_process_next_strm()
X-Git-Tag: v3.2-dev15~21
X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=99f5be5631923844fb738fdc2d001178a2756834;p=thirdparty%2Fhaproxy.git

BUG/MAJOR: queue: lock around the call to pendconn_process_next_strm()

The extra call to pendconn_process_next_strm() made in commit cda7275ef5
("MEDIUM: queue: Handle the race condition between queue and dequeue
differently") was performed after releasing the server queue's lock,
which is incompatible with the calling convention for this function.
The result is random corruption of the server's streams list likely
due to picking old or incorrect pendconns from the queue, and in the
end infinitely looping on apparently already locked mt_list objects.
Just adding the lock fixes the problem.

It's very difficult to reproduce, it requires low maxconn values on
servers, stickiness on the servers (cookie), a long enough slowstart
(e.g. 10s), and regularly flipping servers up/down to re-trigger the
slowstart.

No backport is needed as this was only in 3.2.
---

diff --git a/src/queue.c b/src/queue.c
index 62832f07a..d5f737bf5 100644
--- a/src/queue.c
+++ b/src/queue.c
@@ -513,12 +513,15 @@ int process_srv_queue(struct server *s)
 		 * just in case try to run one more stream.
 		 */
 		for (i = 0; i < global.nbtgroups; i++) {
+			HA_SPIN_LOCK(QUEUE_LOCK, &s->per_tgrp[i].queue.lock);
 			if (pendconn_process_next_strm(s, p, px_ok, i + 1)) {
+				HA_SPIN_UNLOCK(QUEUE_LOCK, &s->per_tgrp[i].queue.lock);
 				_HA_ATOMIC_SUB(&p->totpend, 1);
 				_HA_ATOMIC_ADD(&p->served, 1);
 				done++;
 				break;
 			}
+			HA_SPIN_UNLOCK(QUEUE_LOCK, &s->per_tgrp[i].queue.lock);
 		}
 	}
 	return done;