From: Willy Tarreau <w@1wt.eu>
Date: Fri, 26 Feb 2021 09:18:11 +0000 (+0100)
Subject: MINOR: task: only limit TL_HEAVY tasks but not others
X-Git-Tag: v2.4-dev10~14
X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=76390dac0654f6c539f4c20bb7c2ad3a6e74eea9;p=thirdparty%2Fhaproxy.git

MINOR: task: only limit TL_HEAVY tasks but not others

The preliminary approach to dealing with heavy tasks forced us to quit
the poller after meeting one. Now instead we process at most one per poll
loop and ignore the next ones, so that we get more bandwidth to process
all other classes.

Doing so further reduced the induced HTTP request latency at 100k req/s
under the stress of 1000 concurrent SSL handshakes in the following
proportions:

            |   default  | low-latency
   ---------+------------+--------------
    before  |   2.75 ms  |   2.0 ms
    after   |   1.38 ms  |   0.98 ms

In both cases, the latency is roughly halved. It's worth noting that
both values are now exactly 10 times better than in 2.4-dev9. Even the
percentiles have much improved. For 16 HTTP connections (1 per thread)
competing with 1000 SSL handshakes, we're seeing these long-tail
latencies (in milliseconds) :

              |  99.5%  |  99.9%  |  100%
   -----------+---------+---------+--------
   2.4-dev9   |  48.4   |  58.1   |  78.5
   previous   |   6.2   |  11.4   |  67.8
   this patch |   2.8   |   2.9   |   6.1

The task latency profiling report now shows this in default mode:
  $ socat - /tmp/sock1 <<< "show profiling"
  Per-task CPU profiling              : on      # set profiling tasks {on|auto|off}
  Tasks activity:
    function                      calls   cpu_tot   cpu_avg   lat_tot   lat_avg
    si_cs_io_cb                 3061966   2.224s    726.0ns   42.03s    13.72us
    h1_io_cb                    3061960   6.418s    2.096us   18.76m    367.6us
    process_stream              3059982   9.137s    2.985us   15.52m    304.3us
    ssl_sock_io_cb               602657   4.265m    424.7us   4.736h    28.29ms
    h1_timeout_task              202973      -         -      6.254s    30.81us
    accept_queue_process         135547   1.179s    8.699us   16.29s    120.1us
    srv_cleanup_toremove_conns       81   15.64ms   193.1us   30.87ms   381.1us
    task_run_applet                  10   758.7us   75.87us   51.77us   5.176us
    srv_cleanup_idle_conns            4   375.3us   93.83us   54.52us   13.63us

And this in low-latency mode, showing that both si_cs_io_cb() and process_stream()
have significantly benefitted from the improvement, with values 50 to 200 times
smaller than 2.4-dev9:
  $ socat - /tmp/sock1 <<< "show profiling"
  Per-task CPU profiling              : on      # set profiling tasks {on|auto|off}
  Tasks activity:
    function                      calls   cpu_tot   cpu_avg   lat_tot   lat_avg
    h1_io_cb                    6407006   11.86s    1.851us   31.14m    291.6us
    process_stream              6403890   18.40s    2.873us   2.134m    20.00us
    si_cs_io_cb                 6403866   4.139s    646.0ns   1.773m    16.61us
    ssl_sock_io_cb               894326   6.407m    429.9us   7.326h    29.49ms
    h1_timeout_task              301189      -         -      8.440s    28.02us
    accept_queue_process         211989   1.691s    7.977us   21.48s    101.3us
    srv_cleanup_toremove_conns      220   23.46ms   106.7us   65.61ms   298.2us
    task_run_applet                  16   1.219ms   76.17us   181.7us   11.36us
    srv_cleanup_idle_conns           12   713.3us   59.44us   168.4us   14.03us

The changes are slightly more invasive than previous ones and depend on
recent patches so they are not likely well suited for backporting.
---

diff --git a/src/task.c b/src/task.c
index a64d1dfeb6..4c73b8cc55 100644
--- a/src/task.c
+++ b/src/task.c
@@ -441,7 +441,6 @@ unsigned int run_tasks_from_lists(unsigned int budgets[])
 	unsigned int done = 0;
 	unsigned int queue;
 	unsigned short state;
-	char heavy_calls = 0;
 	void *ctx;
 
 	for (queue = 0; queue < TL_CLASSES;) {
@@ -490,19 +489,6 @@ unsigned int run_tasks_from_lists(unsigned int budgets[])
 		t = (struct task *)LIST_ELEM(tl_queues[queue].n, struct tasklet *, list);
 		state = t->state & (TASK_SHARED_WQ|TASK_SELF_WAKING|TASK_HEAVY|TASK_KILLED);
 
-		if (state & TASK_HEAVY) {
-			/* This is a heavy task. We'll call no more than one
-			 * per function call. If we called one already, we'll
-			 * return and announce the max possible weight so that
-			 * the caller doesn't come back too soon.
-			 */
-			if (heavy_calls) {
-				done = INT_MAX;  // 11ms instead of 3 without this
-				break; // too many heavy tasks processed already
-			}
-			heavy_calls = 1;
-		}
-
 		ti->flags &= ~TI_FL_STUCK; // this thread is still running
 		activity[tid].ctxsw++;
 		ctx = t->context;
@@ -647,6 +633,7 @@ void process_runnable_tasks()
 	unsigned int queue;
 	int max_processed;
 	int lpicked, gpicked;
+	int heavy_queued = 0;
 	int budget;
 
 	ti->flags &= ~TI_FL_STUCK; // this thread is still running
@@ -679,12 +666,16 @@ void process_runnable_tasks()
 		max[TL_BULK] = default_weights[TL_BULK];
 
 	/* heavy tasks are processed only once and never refilled in a
-	 * call round.
+	 * call round. That budget is not lost either as we don't reset
+	 * it unless consumed.
 	 */
-	if ((tt->tl_class_mask & (1 << TL_HEAVY)))
-		max[TL_HEAVY] = default_weights[TL_HEAVY];
-	else
-		max[TL_HEAVY] = 0;
+	if (!heavy_queued) {
+		if ((tt->tl_class_mask & (1 << TL_HEAVY)))
+			max[TL_HEAVY] = default_weights[TL_HEAVY];
+		else
+			max[TL_HEAVY] = 0;
+		heavy_queued = 1;
+	}
 
 	/* Now compute a fair share of the weights. Total may slightly exceed
 	 * 100% due to rounding, this is not a problem. Note that while in
@@ -701,6 +692,12 @@ void process_runnable_tasks()
 	for (queue = 0; queue < TL_CLASSES; queue++)
 		max[queue]  = ((unsigned)max_processed * max[queue] + max_total - 1) / max_total;
 
+	/* The heavy queue must never process more than one task at once
+	 * anyway.
+	 */
+	if (max[TL_HEAVY] > 1)
+		max[TL_HEAVY] = 1;
+
 	lrq = grq = NULL;
 
 	/* pick up to max[TL_NORMAL] regular tasks from prio-ordered run queues */