From: Willy Tarreau Date: Tue, 20 May 2025 13:52:44 +0000 (+0200) Subject: BUG/MEDIUM: wdt: always ignore the first watchdog wakeup X-Git-Tag: v3.2-dev17~17 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=0a8bfb5b900f017c17bf6460669a18cf693b5927;p=thirdparty%2Fhaproxy.git BUG/MEDIUM: wdt: always ignore the first watchdog wakeup With commit a06c215f08 ("MEDIUM: wdt: always make the faulty thread report its own warnings"), when the TH_FL_STUCK flag was flipped on, we'd then go to the panic code instead of giving a second chance like before the commit. This can trigger rare cases that only happen with moderate loads like was addressed by commit 24ce001771 ("BUG/MEDIUM: wdt: fix the stuck detection for warnings"). This is in fact due to the loss of the common "goto update_and_leave" that used to serve both the warning code and the flag setting for probation, and it's apparently what hit Christian in issue #2980. Let's make sure we exit naturally when turning the bit on for the first time. Let's also update the confusing comment at the end of the check that was left over by latest change. Since the first commit was backported to 3.1, this commit should be backported there as well. --- diff --git a/src/wdt.c b/src/wdt.c index 1863cc35d..d52fded05 100644 --- a/src/wdt.c +++ b/src/wdt.c @@ -122,25 +122,25 @@ void wdt_handler(int sig, siginfo_t *si, void *arg) */ if (!(_HA_ATOMIC_LOAD(&ha_thread_ctx[thr].flags) & TH_FL_STUCK)) { /* after one second it's clear that we're stuck */ - if (n - p >= 1000000000ULL) + if (n - p >= 1000000000ULL) { _HA_ATOMIC_OR(&ha_thread_ctx[thr].flags, TH_FL_STUCK); + goto update_and_leave; + } else if (n - p < (ullong)wdt_warn_blocked_traffic_ns) { /* if we haven't crossed the warning boundary, * let's just refresh the reporting thread's timer. */ goto update_and_leave; } - - /* OK so we've crossed the warning boundary and possibly the - * panic one as well. This may only be reported by the original - * thread. Let's fall back to the common code below which will - * possibly bounce to the reporting thread, which will then - * check the ctxsw count and decide whether to do nothing, to - * warn, or either panic. - */ } - /* No doubt now, there's no hop to recover, die loudly! */ + /* OK so we've crossed the warning boundary and possibly the + * panic one as well. This may only be reported by the original + * thread. Let's fall back to the common code below which will + * possibly bounce to the reporting thread, which will then + * check the ctxsw count and decide whether to do nothing, to + * warn, or either panic. + */ break; #if defined(USE_THREAD) && defined(SI_TKILL) /* Linux uses this */