]>
Commit | Line | Data |
---|---|---|
b3071c51 GKH |
1 | From a1cbcaa9ea87b87a96b9fc465951dcf36e459ca2 Mon Sep 17 00:00:00 2001 |
2 | From: Thomas Gleixner <tglx@linutronix.de> | |
3 | Date: Sat, 6 Apr 2013 10:10:27 +0200 | |
4 | Subject: sched_clock: Prevent 64bit inatomicity on 32bit systems | |
5 | ||
6 | From: Thomas Gleixner <tglx@linutronix.de> | |
7 | ||
8 | commit a1cbcaa9ea87b87a96b9fc465951dcf36e459ca2 upstream. | |
9 | ||
10 | The sched_clock_remote() implementation has the following inatomicity | |
11 | problem on 32bit systems when accessing the remote scd->clock, which | |
12 | is a 64bit value. | |
13 | ||
14 | CPU0 CPU1 | |
15 | ||
16 | sched_clock_local() sched_clock_remote(CPU0) | |
17 | ... | |
18 | remote_clock = scd[CPU0]->clock | |
19 | read_low32bit(scd[CPU0]->clock) | |
20 | cmpxchg64(scd->clock,...) | |
21 | read_high32bit(scd[CPU0]->clock) | |
22 | ||
23 | While the update of scd->clock is using an atomic64 mechanism, the | |
24 | readout on the remote cpu is not, which can cause completely bogus | |
25 | readouts. | |
26 | ||
27 | It is a quite rare problem, because it requires the update to hit the | |
28 | narrow race window between the low/high readout and the update must go | |
29 | across the 32bit boundary. | |
30 | ||
31 | The resulting misbehaviour is, that CPU1 will see the sched_clock on | |
32 | CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value | |
33 | to this bogus timestamp. This stays that way due to the clamping | |
34 | implementation for about 4 seconds until the synchronization with | |
35 | CLOCK_MONOTONIC undoes the problem. | |
36 | ||
37 | The issue is hard to observe, because it might only result in a less | |
38 | accurate SCHED_OTHER timeslicing behaviour. To create observable | |
39 | damage on realtime scheduling classes, it is necessary that the bogus | |
40 | update of CPU1 sched_clock happens in the context of an realtime | |
41 | thread, which then gets charged 4 seconds of RT runtime, which results | |
42 | in the RT throttler mechanism to trigger and prevent scheduling of RT | |
43 | tasks for a little less than 4 seconds. So this is quite unlikely as | |
44 | well. | |
45 | ||
46 | The issue was quite hard to decode as the reproduction time is between | |
47 | 2 days and 3 weeks and intrusive tracing makes it less likely, but the | |
48 | following trace recorded with trace_clock=global, which uses | |
49 | sched_clock_local(), gave the final hint: | |
50 | ||
51 | <idle>-0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80 | |
52 | <idle>-0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ... | |
53 | irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0 | |
54 | <idle>-0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80 | |
55 | ||
56 | What happens is that CPU0 goes idle and invokes | |
57 | sched_clock_idle_sleep_event() which invokes sched_clock_local() and | |
58 | CPU1 runs a remote wakeup for CPU0 at the same time, which invokes | |
59 | sched_remote_clock(). The time jump gets propagated to CPU0 via | |
60 | sched_remote_clock() and stays stale on both cores for ~4 seconds. | |
61 | ||
62 | There are only two other possibilities, which could cause a stale | |
63 | sched clock: | |
64 | ||
65 | 1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic | |
66 | wrong value. | |
67 | ||
68 | 2) sched_clock() which reads the TSC returns a sporadic wrong value. | |
69 | ||
70 | #1 can be excluded because sched_clock would continue to increase for | |
71 | one jiffy and then go stale. | |
72 | ||
73 | #2 can be excluded because it would not make the clock jump | |
74 | forward. It would just result in a stale sched_clock for one jiffy. | |
75 | ||
76 | After quite some brain twisting and finding the same pattern on other | |
77 | traces, sched_clock_remote() remained the only place which could cause | |
78 | such a problem and as explained above it's indeed racy on 32bit | |
79 | systems. | |
80 | ||
81 | So while on 64bit systems the readout is atomic, we need to verify the | |
82 | remote readout on 32bit machines. We need to protect the local->clock | |
83 | readout in sched_clock_remote() on 32bit as well because an NMI could | |
84 | hit between the low and the high readout, call sched_clock_local() and | |
85 | modify local->clock. | |
86 | ||
87 | Thanks to Siegfried Wulsch for bearing with my debug requests and | |
88 | going through the tedious tasks of running a bunch of reproducer | |
89 | systems to generate the debug information which let me decode the | |
90 | issue. | |
91 | ||
92 | Reported-by: Siegfried Wulsch <Siegfried.Wulsch@rovema.de> | |
93 | Acked-by: Peter Zijlstra <peterz@infradead.org> | |
94 | Cc: Steven Rostedt <rostedt@goodmis.org> | |
95 | Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos | |
96 | Signed-off-by: Thomas Gleixner <tglx@linutronix.de> | |
97 | Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> | |
98 | ||
99 | --- | |
100 | kernel/sched/clock.c | 26 ++++++++++++++++++++++++++ | |
101 | 1 file changed, 26 insertions(+) | |
102 | ||
103 | --- a/kernel/sched/clock.c | |
104 | +++ b/kernel/sched/clock.c | |
105 | @@ -176,10 +176,36 @@ static u64 sched_clock_remote(struct sch | |
106 | u64 this_clock, remote_clock; | |
107 | u64 *ptr, old_val, val; | |
108 | ||
109 | +#if BITS_PER_LONG != 64 | |
110 | +again: | |
111 | + /* | |
112 | + * Careful here: The local and the remote clock values need to | |
113 | + * be read out atomic as we need to compare the values and | |
114 | + * then update either the local or the remote side. So the | |
115 | + * cmpxchg64 below only protects one readout. | |
116 | + * | |
117 | + * We must reread via sched_clock_local() in the retry case on | |
118 | + * 32bit as an NMI could use sched_clock_local() via the | |
119 | + * tracer and hit between the readout of | |
120 | + * the low32bit and the high 32bit portion. | |
121 | + */ | |
122 | + this_clock = sched_clock_local(my_scd); | |
123 | + /* | |
124 | + * We must enforce atomic readout on 32bit, otherwise the | |
125 | + * update on the remote cpu can hit inbetween the readout of | |
126 | + * the low32bit and the high 32bit portion. | |
127 | + */ | |
128 | + remote_clock = cmpxchg64(&scd->clock, 0, 0); | |
129 | +#else | |
130 | + /* | |
131 | + * On 64bit the read of [my]scd->clock is atomic versus the | |
132 | + * update, so we can avoid the above 32bit dance. | |
133 | + */ | |
134 | sched_clock_local(my_scd); | |
135 | again: | |
136 | this_clock = my_scd->clock; | |
137 | remote_clock = scd->clock; | |
138 | +#endif | |
139 | ||
140 | /* | |
141 | * Use the opportunity that we have both locks |