From: Willy Tarreau <w@1wt.eu>
Date: Thu, 16 Feb 2023 08:19:21 +0000 (+0100)
Subject: BUG/MEDIUM: sched: allow a bit more TASK_HEAVY to be processed when needed
X-Git-Tag: v2.8-dev5~190
X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=ba4c7a15978deaf74b6af09d2a13b4fff7ccea74;p=thirdparty%2Fhaproxy.git

BUG/MEDIUM: sched: allow a bit more TASK_HEAVY to be processed when needed

As reported in github issue #1881, there are situations where an excess
of TLS handshakes can cause a livelock. What's happening is that normally
we process at most one TLS handshake per loop iteration to maintain the
latency low. This is done by tagging them with TASK_HEAVY, queuing these
tasklets in the TL_HEAVY queue. But if something slows down the loop, such
as a connect() call when no more ports are available, we could end up
processing no more than a few hundred or thousands handshakes per second.

If the llmit becomes lower than the rate of incoming handshakes, we will
accumulate them and at some point users will get impatient and give up or
retry. Then a new problem happens: the queue fills up with even more
handshake attempts, only one of which will be handled per iteration, so
we can end up processing only outdated handshakes at a low rate, with
basically nothing else in the queue. This can for example happen in
parallel with health checks that don't require incoming handshakes to
succeed to continue to cause some activity that could maintain the high
latency stuff active.

Here we're taking a slightly different approach. First, instead of always
allowing only one handshake per loop (and usually it's critical for
latency), we take the current situation into account:
  - if configured with tune.sched.low-latency, the limit remains 1
  - if there are other non-heavy tasks, we set the limit to 1 + one
    per 1024 tasks, so that a heavily loaded queue of 4k handshakes
    per thread will be able to drain them at ~4 per loops with a
    limited impact on latency
  - if there are no other tasks, the limit grows to 1 + one per 128
    tasks, so that a heavily loaded queue of 4k handshakes per thread
    will be able to drain them at ~32 per loop with still a very
    limited impact on latency since only I/O will get delayed.

It was verified on a 56-core Xeon-8480 that this did not degrade the
latency; all requests remained below 1ms end-to-end in full close+
handshake, and even 500us under low-lat + busy-polling.

This must be backported to 2.4.
---

diff --git a/src/task.c b/src/task.c
index d4625535ab..990ab42813 100644
--- a/src/task.c
+++ b/src/task.c
@@ -770,11 +770,26 @@ void process_runnable_tasks()
 	for (queue = 0; queue < TL_CLASSES; queue++)
 		max[queue]  = ((unsigned)max_processed * max[queue] + max_total - 1) / max_total;
 
-	/* The heavy queue must never process more than one task at once
-	 * anyway.
+	/* The heavy queue must never process more than very few tasks at once
+	 * anyway. We set the limit to 1 if running on low_latency scheduling,
+	 * given that we know that other values can have an impact on latency
+	 * (~500us end-to-end connection achieved at 130kcps in SSL), 1 + one
+	 * per 1024 tasks if there is at least one non-heavy task while still
+	 * respecting the ratios above, or 1 + one per 128 tasks if only heavy
+	 * tasks are present. This allows to drain excess SSL handshakes more
+	 * efficiently if the queue becomes congested.
 	 */
-	if (max[TL_HEAVY] > 1)
-		max[TL_HEAVY] = 1;
+	if (max[TL_HEAVY] > 1) {
+		if (global.tune.options & GTUNE_SCHED_LOW_LATENCY)
+			budget = 1;
+		else if (tt->tl_class_mask & ~(1 << TL_HEAVY))
+			budget = 1 + tt->rq_total / 1024;
+		else
+			budget = 1 + tt->rq_total / 128;
+
+		if (max[TL_HEAVY] > budget)
+			max[TL_HEAVY] = budget;
+	}
 
 	lrq = grq = NULL;