From: Willy Tarreau Date: Fri, 26 Feb 2021 09:18:11 +0000 (+0100) Subject: MINOR: task: only limit TL_HEAVY tasks but not others X-Git-Tag: v2.4-dev10~14 X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=76390dac0654f6c539f4c20bb7c2ad3a6e74eea9;p=thirdparty%2Fhaproxy.git MINOR: task: only limit TL_HEAVY tasks but not others The preliminary approach to dealing with heavy tasks forced us to quit the poller after meeting one. Now instead we process at most one per poll loop and ignore the next ones, so that we get more bandwidth to process all other classes. Doing so further reduced the induced HTTP request latency at 100k req/s under the stress of 1000 concurrent SSL handshakes in the following proportions: | default | low-latency ---------+------------+-------------- before | 2.75 ms | 2.0 ms after | 1.38 ms | 0.98 ms In both cases, the latency is roughly halved. It's worth noting that both values are now exactly 10 times better than in 2.4-dev9. Even the percentiles have much improved. For 16 HTTP connections (1 per thread) competing with 1000 SSL handshakes, we're seeing these long-tail latencies (in milliseconds) : | 99.5% | 99.9% | 100% -----------+---------+---------+-------- 2.4-dev9 | 48.4 | 58.1 | 78.5 previous | 6.2 | 11.4 | 67.8 this patch | 2.8 | 2.9 | 6.1 The task latency profiling report now shows this in default mode: $ socat - /tmp/sock1 <<< "show profiling" Per-task CPU profiling : on # set profiling tasks {on|auto|off} Tasks activity: function calls cpu_tot cpu_avg lat_tot lat_avg si_cs_io_cb 3061966 2.224s 726.0ns 42.03s 13.72us h1_io_cb 3061960 6.418s 2.096us 18.76m 367.6us process_stream 3059982 9.137s 2.985us 15.52m 304.3us ssl_sock_io_cb 602657 4.265m 424.7us 4.736h 28.29ms h1_timeout_task 202973 - - 6.254s 30.81us accept_queue_process 135547 1.179s 8.699us 16.29s 120.1us srv_cleanup_toremove_conns 81 15.64ms 193.1us 30.87ms 381.1us task_run_applet 10 758.7us 75.87us 51.77us 5.176us srv_cleanup_idle_conns 4 375.3us 93.83us 54.52us 13.63us And this in low-latency mode, showing that both si_cs_io_cb() and process_stream() have significantly benefitted from the improvement, with values 50 to 200 times smaller than 2.4-dev9: $ socat - /tmp/sock1 <<< "show profiling" Per-task CPU profiling : on # set profiling tasks {on|auto|off} Tasks activity: function calls cpu_tot cpu_avg lat_tot lat_avg h1_io_cb 6407006 11.86s 1.851us 31.14m 291.6us process_stream 6403890 18.40s 2.873us 2.134m 20.00us si_cs_io_cb 6403866 4.139s 646.0ns 1.773m 16.61us ssl_sock_io_cb 894326 6.407m 429.9us 7.326h 29.49ms h1_timeout_task 301189 - - 8.440s 28.02us accept_queue_process 211989 1.691s 7.977us 21.48s 101.3us srv_cleanup_toremove_conns 220 23.46ms 106.7us 65.61ms 298.2us task_run_applet 16 1.219ms 76.17us 181.7us 11.36us srv_cleanup_idle_conns 12 713.3us 59.44us 168.4us 14.03us The changes are slightly more invasive than previous ones and depend on recent patches so they are not likely well suited for backporting. --- diff --git a/src/task.c b/src/task.c index a64d1dfeb6..4c73b8cc55 100644 --- a/src/task.c +++ b/src/task.c @@ -441,7 +441,6 @@ unsigned int run_tasks_from_lists(unsigned int budgets[]) unsigned int done = 0; unsigned int queue; unsigned short state; - char heavy_calls = 0; void *ctx; for (queue = 0; queue < TL_CLASSES;) { @@ -490,19 +489,6 @@ unsigned int run_tasks_from_lists(unsigned int budgets[]) t = (struct task *)LIST_ELEM(tl_queues[queue].n, struct tasklet *, list); state = t->state & (TASK_SHARED_WQ|TASK_SELF_WAKING|TASK_HEAVY|TASK_KILLED); - if (state & TASK_HEAVY) { - /* This is a heavy task. We'll call no more than one - * per function call. If we called one already, we'll - * return and announce the max possible weight so that - * the caller doesn't come back too soon. - */ - if (heavy_calls) { - done = INT_MAX; // 11ms instead of 3 without this - break; // too many heavy tasks processed already - } - heavy_calls = 1; - } - ti->flags &= ~TI_FL_STUCK; // this thread is still running activity[tid].ctxsw++; ctx = t->context; @@ -647,6 +633,7 @@ void process_runnable_tasks() unsigned int queue; int max_processed; int lpicked, gpicked; + int heavy_queued = 0; int budget; ti->flags &= ~TI_FL_STUCK; // this thread is still running @@ -679,12 +666,16 @@ void process_runnable_tasks() max[TL_BULK] = default_weights[TL_BULK]; /* heavy tasks are processed only once and never refilled in a - * call round. + * call round. That budget is not lost either as we don't reset + * it unless consumed. */ - if ((tt->tl_class_mask & (1 << TL_HEAVY))) - max[TL_HEAVY] = default_weights[TL_HEAVY]; - else - max[TL_HEAVY] = 0; + if (!heavy_queued) { + if ((tt->tl_class_mask & (1 << TL_HEAVY))) + max[TL_HEAVY] = default_weights[TL_HEAVY]; + else + max[TL_HEAVY] = 0; + heavy_queued = 1; + } /* Now compute a fair share of the weights. Total may slightly exceed * 100% due to rounding, this is not a problem. Note that while in @@ -701,6 +692,12 @@ void process_runnable_tasks() for (queue = 0; queue < TL_CLASSES; queue++) max[queue] = ((unsigned)max_processed * max[queue] + max_total - 1) / max_total; + /* The heavy queue must never process more than one task at once + * anyway. + */ + if (max[TL_HEAVY] > 1) + max[TL_HEAVY] = 1; + lrq = grq = NULL; /* pick up to max[TL_NORMAL] regular tasks from prio-ordered run queues */