From: Mayank Rungta Date: Thu, 12 Mar 2026 23:22:02 +0000 (-0700) Subject: watchdog: return early in watchdog_hardlockup_check() X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=3e811cae321904c111f3e963b165c1eb0bc17ae0;p=thirdparty%2Flinux.git watchdog: return early in watchdog_hardlockup_check() Patch series "watchdog/hardlockup: Improvements to hardlockup", v2. This series addresses limitations in the hardlockup detector implementations and updates the documentation to reflect actual behavior and recent changes. The changes are structured as follows: Refactoring (Patch 1) ===================== Patch 1 refactors watchdog_hardlockup_check() to return early if no lockup is detected. This reduces the indentation level of the main logic block, serving as a clean base for the subsequent changes. Hardlockup Detection Improvements (Patches 2 & 4) ================================================= The hardlockup detector logic relies on updating saved interrupt counts to determine if the CPU is making progress. Patch 1 ensures that the saved interrupt count is updated unconditionally before checking the "touched" flag. This prevents stale comparisons which can delay detection. This is a logic fix that ensures the detector remains accurate even when the watchdog is frequently touched. Patch 3 improves the Buddy detector's timeliness. The current checking interval (every 3rd sample) causes high variability in detection time (up to 24s). This patch changes the Buddy detector to check at every hrtimer interval (4s) with a missed-interrupt threshold of 3, narrowing the detection window to a consistent 8-12 second range. Documentation Updates (Patches 3 & 5) ===================================== The current documentation does not fully capture the variable nature of detection latency or the details of the Buddy system. Patch 3 removes the strict "10 seconds" definition of a hardlockup, which was misleading given the periodic nature of the detector. It adds a "Detection Overhead" section to the admin guide, using "Best Case" and "Worst Case" scenarios to illustrate that detection time can vary significantly (e.g., ~6s to ~20s). Patch 5 adds a dedicated section for the Buddy detector, which was previously undocumented. It details the mechanism, the new timing logic, and known limitations. This patch (of 5): Invert the `is_hardlockup(cpu)` check in `watchdog_hardlockup_check()` to return early when a hardlockup is not detected. This flattens the main logic block, reducing the indentation level and making the code easier to read and maintain. This refactoring serves as a preparation patch for future hardlockup changes. Link: https://lkml.kernel.org/r/20260312-hardlockup-watchdog-fixes-v2-0-45bd8a0cc7ed@google.com Link: https://lkml.kernel.org/r/20260312-hardlockup-watchdog-fixes-v2-1-45bd8a0cc7ed@google.com Signed-off-by: Mayank Rungta Reviewed-by: Douglas Anderson Reviewed-by: Petr Mladek Cc: Ian Rogers Cc: Jonathan Corbet Cc: Li Huafei Cc: Max Kellermann Cc: Shuah Khan Cc: Stephane Erainan Cc: Wang Jinchao Cc: Yunhui Cui Signed-off-by: Andrew Morton --- diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 7d675781bc917..4c5b474957455 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -187,6 +187,8 @@ static void watchdog_hardlockup_kick(void) void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs) { int hardlockup_all_cpu_backtrace; + unsigned int this_cpu; + unsigned long flags; if (per_cpu(watchdog_hardlockup_touched, cpu)) { per_cpu(watchdog_hardlockup_touched, cpu) = false; @@ -201,74 +203,73 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs) * fired multiple times before we overflow'd. If it hasn't * then this is a good indication the cpu is stuck */ - if (is_hardlockup(cpu)) { - unsigned int this_cpu = smp_processor_id(); - unsigned long flags; + if (!is_hardlockup(cpu)) { + per_cpu(watchdog_hardlockup_warned, cpu) = false; + return; + } #ifdef CONFIG_SYSFS - ++hardlockup_count; + ++hardlockup_count; #endif - /* - * A poorly behaving BPF scheduler can trigger hard lockup by - * e.g. putting numerous affinitized tasks in a single queue and - * directing all CPUs at it. The following call can return true - * only once when sched_ext is enabled and will immediately - * abort the BPF scheduler and print out a warning message. - */ - if (scx_hardlockup(cpu)) - return; + /* + * A poorly behaving BPF scheduler can trigger hard lockup by + * e.g. putting numerous affinitized tasks in a single queue and + * directing all CPUs at it. The following call can return true + * only once when sched_ext is enabled and will immediately + * abort the BPF scheduler and print out a warning message. + */ + if (scx_hardlockup(cpu)) + return; - /* Only print hardlockups once. */ - if (per_cpu(watchdog_hardlockup_warned, cpu)) - return; + /* Only print hardlockups once. */ + if (per_cpu(watchdog_hardlockup_warned, cpu)) + return; - /* - * Prevent multiple hard-lockup reports if one cpu is already - * engaged in dumping all cpu back traces. - */ - if (hardlockup_all_cpu_backtrace) { - if (test_and_set_bit_lock(0, &hard_lockup_nmi_warn)) - return; - } + /* + * Prevent multiple hard-lockup reports if one cpu is already + * engaged in dumping all cpu back traces. + */ + if (hardlockup_all_cpu_backtrace) { + if (test_and_set_bit_lock(0, &hard_lockup_nmi_warn)) + return; + } - /* - * NOTE: we call printk_cpu_sync_get_irqsave() after printing - * the lockup message. While it would be nice to serialize - * that printout, we really want to make sure that if some - * other CPU somehow locked up while holding the lock associated - * with printk_cpu_sync_get_irqsave() that we can still at least - * get the message about the lockup out. - */ - pr_emerg("CPU%u: Watchdog detected hard LOCKUP on cpu %u\n", this_cpu, cpu); - printk_cpu_sync_get_irqsave(flags); + /* + * NOTE: we call printk_cpu_sync_get_irqsave() after printing + * the lockup message. While it would be nice to serialize + * that printout, we really want to make sure that if some + * other CPU somehow locked up while holding the lock associated + * with printk_cpu_sync_get_irqsave() that we can still at least + * get the message about the lockup out. + */ + this_cpu = smp_processor_id(); + pr_emerg("CPU%u: Watchdog detected hard LOCKUP on cpu %u\n", this_cpu, cpu); + printk_cpu_sync_get_irqsave(flags); - print_modules(); - print_irqtrace_events(current); - if (cpu == this_cpu) { - if (regs) - show_regs(regs); - else - dump_stack(); - printk_cpu_sync_put_irqrestore(flags); - } else { - printk_cpu_sync_put_irqrestore(flags); - trigger_single_cpu_backtrace(cpu); - } + print_modules(); + print_irqtrace_events(current); + if (cpu == this_cpu) { + if (regs) + show_regs(regs); + else + dump_stack(); + printk_cpu_sync_put_irqrestore(flags); + } else { + printk_cpu_sync_put_irqrestore(flags); + trigger_single_cpu_backtrace(cpu); + } - if (hardlockup_all_cpu_backtrace) { - trigger_allbutcpu_cpu_backtrace(cpu); - if (!hardlockup_panic) - clear_bit_unlock(0, &hard_lockup_nmi_warn); - } + if (hardlockup_all_cpu_backtrace) { + trigger_allbutcpu_cpu_backtrace(cpu); + if (!hardlockup_panic) + clear_bit_unlock(0, &hard_lockup_nmi_warn); + } - sys_info(hardlockup_si_mask & ~SYS_INFO_ALL_BT); - if (hardlockup_panic) - nmi_panic(regs, "Hard LOCKUP"); + sys_info(hardlockup_si_mask & ~SYS_INFO_ALL_BT); + if (hardlockup_panic) + nmi_panic(regs, "Hard LOCKUP"); - per_cpu(watchdog_hardlockup_warned, cpu) = true; - } else { - per_cpu(watchdog_hardlockup_warned, cpu) = false; - } + per_cpu(watchdog_hardlockup_warned, cpu) = true; } #else /* CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER */