From: Willy Tarreau Date: Sun, 8 Sep 2024 17:15:38 +0000 (+0200) Subject: BUG/MEDIUM: clock: detect and cover jumps during execution X-Git-Tag: v3.1-dev8~46 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=ef8d8215de2ed03a08172bbccb1146e5b867ce74;p=thirdparty%2Fhaproxy.git BUG/MEDIUM: clock: detect and cover jumps during execution After commit e8b1ad4c2 ("BUG/MEDIUM: clock: also update the date offset on time jumps"), @firexinghe mentioned that the issue was still present in their case. In fact it depends on the load, which affects the probability that the time changes between two poll() calls vs that it changes during poll(). The time correction code used to only deal with the latter. But under load if it changes between two poll() calls, what happens then is that before_poll is off, and after returning from poll(), the date is within bounds defined by before_poll, so no correction is applied. After many tests, it turns out that the most reliable solution without using CLOCK_MONOTONIC is to prevent before_poll from being earlier than the previous after_poll (trivial), and to cover forward jumps, we need to enforce a margin. Given that the watchdog kills a looping task within 2 seconds and that no sane setup triggers it, it seems that 2 seconds remains a safe enough margin. This means that in the worst case, some forward jumps of up to 2 seconds will not be corrected, leading to an apparent fast time and low rates. But this is supposed to be an exceptional event anyway (typically an admin or crontab running ntpdate). For future versions, given that we now opportunistically call now_mono_time() before and after poll(), that returns zero if not supported, we could imagine relying on this one for the thread's local time when it's non-null. --- diff --git a/src/clock.c b/src/clock.c index 296bf63320..3b64b9b149 100644 --- a/src/clock.c +++ b/src/clock.c @@ -403,6 +403,22 @@ void clock_entering_poll(void) gettimeofday(&before_poll, NULL); + /* The time might have jumped either backwards or forwards during tasks + * processing. It's easy to detect a backwards jump, but a forward jump + * needs a marging. Here the upper limit of 2 seconds corresponds to a + * large margin at which the watchdog would already trigger so it looks + * sufficient to avoid false positives most of the time. The goal here + * is to make sure that before_poll can be trusted when entering + * clock_update_local_date() so that we can detect and fix time jumps. + * All this will also make sure we don't report idle/run times that are + * too much wrong during such jumps. + */ + + if (unlikely(__tv_islt(&before_poll, &after_poll))) + before_poll = after_poll; + else if (unlikely(__tv_ms_elapsed(&after_poll, &before_poll) >= 2000)) + tv_ms_add(&before_poll, &after_poll, 2000); + run_time = (before_poll.tv_sec - after_poll.tv_sec) * 1000000U + (before_poll.tv_usec - after_poll.tv_usec); new_cpu_time = now_cpu_time();