From: Willy Tarreau <w@1wt.eu>
Date: Wed, 1 Oct 2025 06:28:54 +0000 (+0200)
Subject: BUG/MEDIUM: wdt: improve stuck task detection accuracy
X-Git-Tag: v3.3-dev9~12
X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=205f1cbf4cad309601d8b801d46e935dbb249d4a;p=thirdparty%2Fhaproxy.git

BUG/MEDIUM: wdt: improve stuck task detection accuracy

The fact that the watchdog timer measures the execution time from the
last return from the poller tends to amplify the impact of multiple
bad tasks, and may explain some of the panics reported by Felipe and
Ricardo in GH issues #3084, #3092 and #3101. The problem is that we
check the time if we see that the scheduler appears not to be moving
anymore, but one situation may still arise and catch a bad task:
  - one slow task takes so long a time that it triggers the watchdog
    twice, emitting a warning the second time (~200ms). The scheduler
    is rightfully marked as stuck.
  - then it completes and the scheduler is no longer stuck. Many other
    tasks run in turn, they all take quite some time but not enough to
    trigger a warning. But collectively their cost adds up.
  - then a task takes more than the warning time (100ms), and causes
    the total execution time to cross the second. The watchdog is
    called, sees that we've spend more than 1 second since we left the
    poller, and marks the thread as stuck.
  - the task is not finished, the watchdog is called again, sees more
    than one second with a stuck thread and panics 100ms later.

The total time away from the poller is indeed more than one second,
which is very bad, but no single task caused this individually, and
while the warnings are OK, the watchdog should not panic in this case.

This patch revisits the approach to store the moment the scheduler was
marked as stuck in the wdt context. The idea is that this date will be
used to detect warnings and panics. And by doing so and exploiting the
new is_sched_alive(thr), we can greatly simplify the mechanism so that
the signal handling thread does the strict minimum (mark the scheduler
as possibly stuck and update the stuck_start date), and only bounces to
the reporting thread if the scheduler made no progress since last call.
This means that without even doing computations in the handing thread,
we can continue to avoid all bounces unless a warning is required. Then
when the reporting thread is signaled, it will check the dates from the
last moment the scheduler was marked, and will decide to warn or panic.

The panic decision continues to pass via a TH_FL_STUCK flag to probe the
code so that exceptionally slow code (e.g. live cert generation etc) can
still find a way to avoid the panic if absolutely certain that things
are still moving.

This means that now we have the guarantee that panics will only happen
if a given task spends more than one full second not moving, and that
warnings will be issued for other calls crossing the warn delay boundary.

This was tested using artificially slow operations, and all combinations
which individually took less than a second only resulted in floods of
warnings even if the total reported time in the warning was much higher,
while those above one second provoked the panic.

One improvement could consist in reporting the time since last stuck
in the thread dumps to differentiate the individual task from the whole
set.

This needs to be backported to 3.2 along with the two previous patches:

    MINOR: sched: let's permit to share the local ctx between threads
    MINOR: sched: pass the thread number to is_sched_alive()
---

diff --git a/src/wdt.c b/src/wdt.c
index 4290d6115..a18745f0f 100644
--- a/src/wdt.c
+++ b/src/wdt.c
@@ -41,6 +41,7 @@
  */
 static struct {
 	timer_t timer;
+	uint64_t stuck_start; /* cpu time when the scheduler's stuck was last set */
 } per_thread_wd_ctx[MAX_THREADS];
 
 /* warn about stuck tasks after this delay (ns) */
@@ -89,14 +90,6 @@ void wdt_handler(int sig, siginfo_t *si, void *arg)
 
 		tgrp = ha_thread_info[thr].tgid;
 		thr_bit = ha_thread_info[thr].ltid_bit;
-		p = ha_thread_ctx[thr].prev_cpu_time;
-		n = now_cpu_time_thread(thr);
-
-		/* not yet reached the deadline of 1 sec,
-		 * or p wasn't initialized yet
-		 */
-		if (!p)
-			goto update_and_leave;
 
 		if ((_HA_ATOMIC_LOAD(&ha_thread_ctx[thr].flags) & TH_FL_SLEEPING) ||
 		    (_HA_ATOMIC_LOAD(&ha_tgroup_ctx[tgrp-1].threads_harmless) & thr_bit)) {
@@ -109,38 +102,21 @@ void wdt_handler(int sig, siginfo_t *si, void *arg)
 			goto update_and_leave;
 		}
 
-		/* So the thread indeed appears locked up. In order to be
-		 * certain that we're not witnessing an exceptional spike of
-		 * CPU usage due to a configuration issue (like running tens
-		 * of thousands of tasks in a single loop), we'll check if the
-		 * scheduler is still alive by setting the TH_FL_STUCK flag
-		 * that the scheduler clears when switching to the next task.
-		 * If it's already set, then it's our second call with no
-		 * progress and the thread is dead. However, if we figure
-		 * that the scheduler made no progress since last time, we'll
-		 * at least emit a warning.
+		/* check whether the scheduler is still running. The first time
+		 * we check, we mark it as possibly stuck to challenge it, we
+		 * store the last date where we did this, and we quit. On next
+		 * wakeup, if it has not moved, we'll wake up the suspicious
+		 * thread which will perform its own date checks. This way we
+		 * avoid complex computations in a possibly unrelated thread
+		 * and don't wake another thread up as long as everything's OK.
 		 */
-		if (!(_HA_ATOMIC_LOAD(&ha_thread_ctx[thr].flags) & TH_FL_STUCK)) {
-			/* after one second it's clear that we're stuck */
-			if (n - p >= 1000000000ULL) {
-				_HA_ATOMIC_OR(&ha_thread_ctx[thr].flags, TH_FL_STUCK);
-				goto update_and_leave;
-			}
-			else if (n - p < (ullong)wdt_warn_blocked_traffic_ns) {
-				/* if we haven't crossed the warning boundary,
-				 * let's just refresh the reporting thread's timer.
-				 */
-				goto update_and_leave;
-			}
+		if (is_sched_alive(thr)) {
+			n = now_cpu_time_thread(thr);
+			_HA_ATOMIC_STORE(&per_thread_wd_ctx[thr].stuck_start, n);
+			goto update_and_leave;
 		}
 
-		/* OK so we've crossed the warning boundary and possibly the
-		 * panic one as well. This may only be reported by the original
-		 * thread. Let's fall back to the common code below which will
-		 * possibly bounce to the reporting thread, which will then
-		 * check the ctxsw count and decide whether to do nothing, to
-		 * warn, or either panic.
-		 */
+		/* Suspiciously didn't change: fall through target thread signaling */
 		break;
 
 #if defined(USE_THREAD) && defined(SI_TKILL) /* Linux uses this */
@@ -165,8 +141,8 @@ void wdt_handler(int sig, siginfo_t *si, void *arg)
 	}
 
 	/* Right here, we either got a bounce from another thread's WDT to
-	 * report a crossed period, or we noticed it for the current thread.
-	 * For other threads, we're bouncing.
+	 * report a suspciously stuck scheduler, or we noticed it for the
+	 * current thread. For other threads, we're bouncing.
 	 */
 #ifdef USE_THREAD
 	if (thr != tid) {
@@ -175,23 +151,52 @@ void wdt_handler(int sig, siginfo_t *si, void *arg)
 	}
 #endif
 
-	/* Now the interesting things begin. We're on the thread of interest.
-	 * Its timer was at least as large as the warning threshold since poll
-	 * was left. If it was at least as high as the panic threshold, we also
-	 * have TH_FL_STUCK, which now proves that nothing is happening since
-	 * the scheduler clears it for each task. We can still recheck whether
-	 * the scheduler looks alive and get away with all of this if we've got
-	 * a proof that it's making forward progress. If stuck, we have to die,
-	 * otherwise we just send a warning. In short, is_sched_alive() serves
-	 * as a ping to detect the warning condition while TH_FL_STUCK works
-	 * the same but for a panic condition.
+	/* OK here we're on the target thread (thr==tid). It was reported that
+	 * the scheduler was not moving. This might have changed since, if we
+	 * got that from another thread. Otherwise we'll run time checks to
+	 * verify the situation, and possibly the need to warn or panic.
+	 */
+	n = now_cpu_time();
+
+	if (is_sched_alive(thr)) {
+		_HA_ATOMIC_STORE(&per_thread_wd_ctx[thr].stuck_start, n);
+		goto update_and_leave;
+	}
+
+	/* check when we saw last activity (in CPU time) */
+	p = ha_thread_ctx[thr].prev_cpu_time;
+
+	/* p not yet initialized (e.g. signal received during early boot) */
+	if (!p)
+		goto update_and_leave;
+
+	/* check the most recent known activity */
+	if (p < per_thread_wd_ctx[thr].stuck_start)
+		p = per_thread_wd_ctx[thr].stuck_start;
+
+	/* if we haven't crossed the warning boundary, let's just refresh the
+	 * reporting thread's timer.
+	 */
+	if (n - p < (ullong)wdt_warn_blocked_traffic_ns)
+		goto update_and_leave;
+
+	/* The thread indeed appears locked up, it hasn't made any progress
+	 * for at least the configured warning time. If it crosses the second,
+	 * we'll mark it with TH_FL_STUCK so that the next call will panic.
+	 * Doing so still permits exceptionally long operations to mark
+	 * themselves as under control and not stuck to avoid the panic.
+	 * Otherwise we just emit a warning, and this one doesn't consider
+	 * TH_FL_STUCK (i.e. a slow code path must always be reported to the
+	 * user, even if under control).
 	 */
 	if (_HA_ATOMIC_LOAD(&th_ctx->flags) & TH_FL_STUCK)
 		ha_panic();
 
-	if (!is_sched_alive(thr))
-		ha_stuck_warning();
+	/* after one second it's clear that we're stuck */
+	if (n - p >= 1000000000ULL)
+		_HA_ATOMIC_OR(&ha_thread_ctx[thr].flags, TH_FL_STUCK);
 
+	ha_stuck_warning();
 	/* let's go on */
 
  update_and_leave: