From: Greg Kroah-Hartman Date: Tue, 13 Feb 2024 16:19:01 +0000 (+0100) Subject: 6.1-stable patches X-Git-Tag: v6.1.78~21 X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=58d963dd577ac2ecdbb54e7f594d77cdde29463e;p=thirdparty%2Fkernel%2Fstable-queue.git 6.1-stable patches added patches: clocksource-skip-watchdog-check-for-large-watchdog-intervals.patch --- diff --git a/queue-6.1/clocksource-skip-watchdog-check-for-large-watchdog-intervals.patch b/queue-6.1/clocksource-skip-watchdog-check-for-large-watchdog-intervals.patch new file mode 100644 index 00000000000..1d31cf0ce59 --- /dev/null +++ b/queue-6.1/clocksource-skip-watchdog-check-for-large-watchdog-intervals.patch @@ -0,0 +1,136 @@ +From 644649553508b9bacf0fc7a5bdc4f9e0165576a5 Mon Sep 17 00:00:00 2001 +From: Jiri Wiesner +Date: Mon, 22 Jan 2024 18:23:50 +0100 +Subject: clocksource: Skip watchdog check for large watchdog intervals + +From: Jiri Wiesner + +commit 644649553508b9bacf0fc7a5bdc4f9e0165576a5 upstream. + +There have been reports of the watchdog marking clocksources unstable on +machines with 8 NUMA nodes: + + clocksource: timekeeping watchdog on CPU373: + Marking clocksource 'tsc' as unstable because the skew is too large: + clocksource: 'hpet' wd_nsec: 14523447520 + clocksource: 'tsc' cs_nsec: 14524115132 + +The measured clocksource skew - the absolute difference between cs_nsec +and wd_nsec - was 668 microseconds: + + cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612 + +The kernel used 200 microseconds for the uncertainty_margin of both the +clocksource and watchdog, resulting in a threshold of 400 microseconds (the +md variable). Both the cs_nsec and the wd_nsec value indicate that the +readout interval was circa 14.5 seconds. The observed behaviour is that +watchdog checks failed for large readout intervals on 8 NUMA node +machines. This indicates that the size of the skew was directly proportinal +to the length of the readout interval on those machines. The measured +clocksource skew, 668 microseconds, was evaluated against a threshold (the +md variable) that is suited for readout intervals of roughly +WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second. + +The intention of 2e27e793e280 ("clocksource: Reduce clocksource-skew +threshold") was to tighten the threshold for evaluating skew and set the +lower bound for the uncertainty_margin of clocksources to twice +WATCHDOG_MAX_SKEW. Later in c37e85c135ce ("clocksource: Loosen clocksource +watchdog constraints"), the WATCHDOG_MAX_SKEW constant was increased to +125 microseconds to fit the limit of NTP, which is able to use a +clocksource that suffers from up to 500 microseconds of skew per second. +Both the TSC and the HPET use default uncertainty_margin. When the +readout interval gets stretched the default uncertainty_margin is no +longer a suitable lower bound for evaluating skew - it imposes a limit +that is far stricter than the skew with which NTP can deal. + +The root causes of the skew being directly proportinal to the length of +the readout interval are: + + * the inaccuracy of the shift/mult pairs of clocksources and the watchdog + * the conversion to nanoseconds is imprecise for large readout intervals + +Prevent this by skipping the current watchdog check if the readout +interval exceeds 2 * WATCHDOG_INTERVAL. Considering the maximum readout +interval of 2 * WATCHDOG_INTERVAL, the current default uncertainty margin +(of the TSC and HPET) corresponds to a limit on clocksource skew of 250 +ppm (microseconds of skew per second). To keep the limit imposed by NTP +(500 microseconds of skew per second) for all possible readout intervals, +the margins would have to be scaled so that the threshold value is +proportional to the length of the actual readout interval. + +As for why the readout interval may get stretched: Since the watchdog is +executed in softirq context the expiration of the watchdog timer can get +severely delayed on account of a ksoftirqd thread not getting to run in a +timely manner. Surely, a system with such belated softirq execution is not +working well and the scheduling issue should be looked into but the +clocksource watchdog should be able to deal with it accordingly. + +Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold") +Suggested-by: Feng Tang +Signed-off-by: Jiri Wiesner +Signed-off-by: Thomas Gleixner +Tested-by: Paul E. McKenney +Reviewed-by: Feng Tang +Cc: stable@vger.kernel.org +Link: https://lore.kernel.org/r/20240122172350.GA740@incl +Signed-off-by: Greg Kroah-Hartman +--- + kernel/time/clocksource.c | 25 ++++++++++++++++++++++++- + 1 file changed, 24 insertions(+), 1 deletion(-) + +--- a/kernel/time/clocksource.c ++++ b/kernel/time/clocksource.c +@@ -126,6 +126,7 @@ static DECLARE_WORK(watchdog_work, clock + static DEFINE_SPINLOCK(watchdog_lock); + static int watchdog_running; + static atomic_t watchdog_reset_pending; ++static int64_t watchdog_max_interval; + + static inline void clocksource_watchdog_lock(unsigned long *flags) + { +@@ -144,6 +145,7 @@ static void __clocksource_change_rating( + * Interval: 0.5sec. + */ + #define WATCHDOG_INTERVAL (HZ >> 1) ++#define WATCHDOG_INTERVAL_MAX_NS ((2 * WATCHDOG_INTERVAL) * (NSEC_PER_SEC / HZ)) + + static void clocksource_watchdog_work(struct work_struct *work) + { +@@ -396,8 +398,8 @@ static inline void clocksource_reset_wat + static void clocksource_watchdog(struct timer_list *unused) + { + u64 csnow, wdnow, cslast, wdlast, delta; ++ int64_t wd_nsec, cs_nsec, interval; + int next_cpu, reset_pending; +- int64_t wd_nsec, cs_nsec; + struct clocksource *cs; + enum wd_read_status read_ret; + unsigned long extra_wait = 0; +@@ -467,6 +469,27 @@ static void clocksource_watchdog(struct + if (atomic_read(&watchdog_reset_pending)) + continue; + ++ /* ++ * The processing of timer softirqs can get delayed (usually ++ * on account of ksoftirqd not getting to run in a timely ++ * manner), which causes the watchdog interval to stretch. ++ * Skew detection may fail for longer watchdog intervals ++ * on account of fixed margins being used. ++ * Some clocksources, e.g. acpi_pm, cannot tolerate ++ * watchdog intervals longer than a few seconds. ++ */ ++ interval = max(cs_nsec, wd_nsec); ++ if (unlikely(interval > WATCHDOG_INTERVAL_MAX_NS)) { ++ if (system_state > SYSTEM_SCHEDULING && ++ interval > 2 * watchdog_max_interval) { ++ watchdog_max_interval = interval; ++ pr_warn("Long readout interval, skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n", ++ cs_nsec, wd_nsec); ++ } ++ watchdog_timer.expires = jiffies; ++ continue; ++ } ++ + /* Check the deviation from the watchdog clocksource. */ + md = cs->uncertainty_margin + watchdog->uncertainty_margin; + if (abs(cs_nsec - wd_nsec) > md) { diff --git a/queue-6.1/series b/queue-6.1/series index 20bd8922cc9..1d4138d4228 100644 --- a/queue-6.1/series +++ b/queue-6.1/series @@ -61,3 +61,4 @@ vhost-use-kzalloc-instead-of-kmalloc-followed-by-memset.patch rdma-irdma-fix-support-for-64k-pages.patch f2fs-add-helper-to-check-compression-level.patch block-treat-poll-queue-enter-similarly-to-timeouts.patch +clocksource-skip-watchdog-check-for-large-watchdog-intervals.patch