From: Zbigniew Jędrzejewski-Szmek Date: Fri, 20 Dec 2024 18:14:51 +0000 (+0100) Subject: shared/watchdog: ratelimit the number of attempts to open watchdog X-Git-Tag: v258-rc1~1012^2~3 X-Git-Url: http://git.ipfire.org/gitweb.cgi?a=commitdiff_plain;h=b27c9279f6cbed793615bcc6fcdeeac6a8c5ec54;p=thirdparty%2Fsystemd.git shared/watchdog: ratelimit the number of attempts to open watchdog We need to retry the open attempts for the watchdog, because the device becomes available asynchronously. The watchdog is opened in two places: - in pid1 in the main loop. The loop has a ratelimit, but during a boot we iterate in it fairly quickly. On my test VM with 'iTCO_wdt', version 2: $ journalctl -b --grep 'Failed to open any watchdog' | wc -l 3398 After the device has been processed by udev, it is initialized successfully. - in shutdown. In that case, we most likely don't need to try more than once, because we mostly care about the case where the watchdog device was present and configured previously. But in principle it is possible that we might attempt shutdown while the machine was initializing, so we don't want to disable retries. Nevertheless, watchdog_ping() is called from a loop that might be fairly tight, so we could end up trying to reopen the device fairly often. This probably doesn't matter *too* much, but it's still ugly to try to open the device without any ratelimit. Usually the watchdog timeout would be set to something like 30 s or a few minutes. OTOH, on my VM, the device becomes avaiable at 4.35 s after boot. So let's use 5 s or half the watchdog timeout, whatever is smaller, as the interval. --- diff --git a/src/shared/watchdog.c b/src/shared/watchdog.c index 22c28f0c374..449c070f8f4 100644 --- a/src/shared/watchdog.c +++ b/src/shared/watchdog.c @@ -310,12 +310,20 @@ static int watchdog_update_timeout(void) { return watchdog_ping_now(); } -static int watchdog_open(void) { +static int watchdog_open(bool ignore_ratelimit) { + static RateLimit watchdog_open_ratelimit = { 0, 1 }; /* The interval is initialized dynamically below. */ struct watchdog_info ident; char **try_order; int r; assert(watchdog_fd < 0); + assert(watchdog_timeout > 0); + + watchdog_open_ratelimit.interval = MIN(watchdog_timeout / 2, 5 * USEC_PER_SEC); + + if (!ratelimit_below(&watchdog_open_ratelimit)) + if (!ignore_ratelimit) /* If ignore_ratelimit, we update the timestamp, but continue. */ + return -EWOULDBLOCK; /* Let's prefer new-style /dev/watchdog0 (i.e. kernel 3.5+) over classic /dev/watchdog. The former * has the benefit that we can easily find the matching directory in sysfs from it, as the relevant @@ -405,7 +413,7 @@ int watchdog_setup(usec_t timeout) { watchdog_timeout = timeout; if (watchdog_fd < 0) - return watchdog_open(); + return watchdog_open(/* ignore_ratelimit= */ false); r = watchdog_update_timeout(); if (r < 0) @@ -465,7 +473,7 @@ int watchdog_ping(void) { if (watchdog_fd < 0) /* open_watchdog() will automatically ping the device for us if necessary */ - return watchdog_open(); + return watchdog_open(/* ignore_ratelimit= */ false); /* Never ping earlier than watchdog_timeout/4 and try to ping * by watchdog_timeout/2 plus scheduling latencies at the latest */ @@ -490,13 +498,19 @@ void watchdog_report_if_missing(void) { * * If a device was specified explicitly, raise level. */ - if (watchdog_fd == -ENOENT) - log_full_errno(watchdog_device ? LOG_WARNING : LOG_NOTICE, - watchdog_fd, - "Failed to open %swatchdog device%s%s before the initial transaction completed: %m", - watchdog_device ? "" : "any ", - watchdog_device ? " " : "", - strempty(watchdog_device)); + if (watchdog_fd != -ENOENT) + return; + + /* We attempt to open the watchdog one last time here, so that if we log, we log accurately. */ + if (watchdog_open(/* ignore_ratelimit= */ true) >= 0) + return; + + log_full_errno(watchdog_device ? LOG_WARNING : LOG_NOTICE, + watchdog_fd, + "Failed to open %swatchdog device%s%s before the initial transaction completed: %m", + watchdog_device ? "" : "any ", + watchdog_device ? " " : "", + strempty(watchdog_device)); } void watchdog_close(bool disarm) {