doc: watchdog: clarify hardlockup detection timing

author Mayank Rungta <mrungta@google.com>

Thu, 12 Mar 2026 23:22:04 +0000 (16:22 -0700)

committer Andrew Morton <akpm@linux-foundation.org>

Sat, 28 Mar 2026 04:19:47 +0000 (21:19 -0700)
author Mayank Rungta <mrungta@google.com>
Thu, 12 Mar 2026 23:22:04 +0000 (16:22 -0700)
committer Andrew Morton <akpm@linux-foundation.org>
Sat, 28 Mar 2026 04:19:47 +0000 (21:19 -0700)
diff --git a/Documentation/admin-guide/lockup-watchdogs.rst b/Documentation/admin-guide/lockup-watchdogs.rst

index 3e09284a8b9bef75c0ac1607a1809ac3b8a4c1ea..1b374053771f676d874716b3210cade55ae89b28 100644 (file)
--- a/Documentation/admin-guide/lockup-watchdogs.rst
+++ b/Documentation/admin-guide/lockup-watchdogs.rst
@@ -16,7 +16,7 @@ details), and a compile option, "BOOTPARAM_SOFTLOCKUP_PANIC", are
  provided for this.
  
  A 'hardlockup' is defined as a bug that causes the CPU to loop in
-kernel mode for more than 10 seconds (see "Implementation" below for
+kernel mode for several seconds (see "Implementation" below for
  details), without letting other interrupts have a chance to run.
  Similarly to the softlockup case, the current stack trace is displayed
  upon detection and the system will stay locked up unless the default
@@ -64,6 +64,45 @@ administrators to configure the period of the hrtimer and the perf
  event. The right value for a particular environment is a trade-off
  between fast response to lockups and detection overhead.
  
+Detection Overhead
+------------------
+
+The hardlockup detector checks for lockups using a periodic NMI perf
+event. This means the time to detect a lockup can vary depending on
+when the lockup occurs relative to the NMI check window.
+
+**Best Case:**
+In the best case scenario, the lockup occurs just before the first
+heartbeat is due. The detector will notice the missing hrtimer
+interrupt almost immediately during the next check.
+
+::
+
+  Time 100.0: cpu 1 heartbeat
+  Time 100.1: hardlockup_check, cpu1 stores its state
+  Time 103.9: Hard Lockup on cpu1
+  Time 104.0: cpu 1 heartbeat never comes
+  Time 110.1: hardlockup_check, cpu1 checks the state again, should be the same, declares lockup
+
+  Time to detection: ~6 seconds
+
+**Worst Case:**
+In the worst case scenario, the lockup occurs shortly after a valid
+interrupt (heartbeat) which itself happened just after the NMI check.
+The next NMI check sees that the interrupt count has changed (due to
+that one heartbeat), assumes the CPU is healthy, and resets the
+baseline. The lockup is only detected at the subsequent check.
+
+::
+
+  Time 100.0: hardlockup_check, cpu1 stores its state
+  Time 100.1: cpu 1 heartbeat
+  Time 100.2: Hard Lockup on cpu1
+  Time 110.0: hardlockup_check, cpu1 stores its state (misses lockup as state changed)
+  Time 120.0: hardlockup_check, cpu1 checks the state again, should be the same, declares lockup
+
+  Time to detection: ~20 seconds
+
  By default, the watchdog runs on all online cores.  However, on a
  kernel configured with NO_HZ_FULL, by default the watchdog runs only
  on the housekeeping cores, not the cores specified in the "nohz_full"
author	Mayank Rungta <mrungta@google.com>
	Thu, 12 Mar 2026 23:22:04 +0000 (16:22 -0700)
committer	Andrew Morton <akpm@linux-foundation.org>
	Sat, 28 Mar 2026 04:19:47 +0000 (21:19 -0700)