If two slow tasks trigger one warning even a few seconds apart, the
watchdog code will mistakenly take this for a definite stuck task and
kill the process. The reason is that since commit
148eb5875f ("DEBUG:
wdt: better detect apparently locked up threads and warn about them")
the updated ctxsw count is not the correct one, instead of updating
the private counter it resets the public one, preventing it from making
progress and making the wdt believe that no progress was made. In
addition the initial value was read from [tid] instead of [thr].
Please note that another fix is needed in debug_handler() otherwise the
watchdog will fire early after the first warning or thread dump.
A simple test for this is to issue several of these commands back-to-back
on the CLI, which crashes an unfixed 3.1 very quickly:
$ socat /tmp/sock1 - <<< "expert-mode on; debug dev loop 1000"
This needs to be backported to 2.9 since the fix above was backported
there. The impact on 3.0 and 2.9 is almost inexistent since the watchdog
there doesn't apply the shorter warning delay, so the first call already
indicates that the thread is stuck.
if (!(_HA_ATOMIC_LOAD(&ha_thread_ctx[thr].flags) & TH_FL_STUCK)) {
uint prev_ctxsw;
- prev_ctxsw = HA_ATOMIC_LOAD(&per_thread_wd_ctx[tid].prev_ctxsw);
+ prev_ctxsw = HA_ATOMIC_LOAD(&per_thread_wd_ctx[thr].prev_ctxsw);
/* only after one second it's clear we're stuck */
if (n - p >= 1000000000ULL)
* a warning (unless already stuck).
*/
if (n - p >= (ullong)wdt_warn_blocked_traffic_ns) {
- if (HA_ATOMIC_LOAD(&activity[thr].ctxsw) == prev_ctxsw)
+ uint curr_ctxsw = HA_ATOMIC_LOAD(&activity[thr].ctxsw);
+
+ if (curr_ctxsw == prev_ctxsw)
ha_stuck_warning(thr);
- HA_ATOMIC_STORE(&activity[thr].ctxsw, prev_ctxsw);
+ HA_ATOMIC_STORE(&per_thread_wd_ctx[thr].prev_ctxsw, curr_ctxsw);
}
goto update_and_leave;