From 24ce001771a7609b2a3902fc1f851668ef176c59 Mon Sep 17 00:00:00 2001 From: Willy Tarreau Date: Thu, 21 Nov 2024 19:11:18 +0100 Subject: [PATCH] BUG/MEDIUM: wdt: fix the stuck detection for warnings If two slow tasks trigger one warning even a few seconds apart, the watchdog code will mistakenly take this for a definite stuck task and kill the process. The reason is that since commit 148eb5875f ("DEBUG: wdt: better detect apparently locked up threads and warn about them") the updated ctxsw count is not the correct one, instead of updating the private counter it resets the public one, preventing it from making progress and making the wdt believe that no progress was made. In addition the initial value was read from [tid] instead of [thr]. Please note that another fix is needed in debug_handler() otherwise the watchdog will fire early after the first warning or thread dump. A simple test for this is to issue several of these commands back-to-back on the CLI, which crashes an unfixed 3.1 very quickly: $ socat /tmp/sock1 - <<< "expert-mode on; debug dev loop 1000" This needs to be backported to 2.9 since the fix above was backported there. The impact on 3.0 and 2.9 is almost inexistent since the watchdog there doesn't apply the shorter warning delay, so the first call already indicates that the thread is stuck. --- src/wdt.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/src/wdt.c b/src/wdt.c index dcec435d4a..2a9e41605a 100644 --- a/src/wdt.c +++ b/src/wdt.c @@ -120,7 +120,7 @@ void wdt_handler(int sig, siginfo_t *si, void *arg) if (!(_HA_ATOMIC_LOAD(&ha_thread_ctx[thr].flags) & TH_FL_STUCK)) { uint prev_ctxsw; - prev_ctxsw = HA_ATOMIC_LOAD(&per_thread_wd_ctx[tid].prev_ctxsw); + prev_ctxsw = HA_ATOMIC_LOAD(&per_thread_wd_ctx[thr].prev_ctxsw); /* only after one second it's clear we're stuck */ if (n - p >= 1000000000ULL) @@ -131,9 +131,11 @@ void wdt_handler(int sig, siginfo_t *si, void *arg) * a warning (unless already stuck). */ if (n - p >= (ullong)wdt_warn_blocked_traffic_ns) { - if (HA_ATOMIC_LOAD(&activity[thr].ctxsw) == prev_ctxsw) + uint curr_ctxsw = HA_ATOMIC_LOAD(&activity[thr].ctxsw); + + if (curr_ctxsw == prev_ctxsw) ha_stuck_warning(thr); - HA_ATOMIC_STORE(&activity[thr].ctxsw, prev_ctxsw); + HA_ATOMIC_STORE(&per_thread_wd_ctx[thr].prev_ctxsw, curr_ctxsw); } goto update_and_leave; -- 2.39.5