]>
Commit | Line | Data |
---|---|---|
eb414681 JW |
1 | ================================ |
2 | PSI - Pressure Stall Information | |
3 | ================================ | |
4 | ||
5 | :Date: April, 2018 | |
6 | :Author: Johannes Weiner <hannes@cmpxchg.org> | |
7 | ||
8 | When CPU, memory or IO devices are contended, workloads experience | |
9 | latency spikes, throughput losses, and run the risk of OOM kills. | |
10 | ||
11 | Without an accurate measure of such contention, users are forced to | |
12 | either play it safe and under-utilize their hardware resources, or | |
13 | roll the dice and frequently suffer the disruptions resulting from | |
14 | excessive overcommit. | |
15 | ||
16 | The psi feature identifies and quantifies the disruptions caused by | |
17 | such resource crunches and the time impact it has on complex workloads | |
18 | or even entire systems. | |
19 | ||
20 | Having an accurate measure of productivity losses caused by resource | |
21 | scarcity aids users in sizing workloads to hardware--or provisioning | |
22 | hardware according to workload demand. | |
23 | ||
24 | As psi aggregates this information in realtime, systems can be managed | |
25 | dynamically using techniques such as load shedding, migrating jobs to | |
26 | other systems or data centers, or strategically pausing or killing low | |
27 | priority or restartable batch jobs. | |
28 | ||
29 | This allows maximizing hardware utilization without sacrificing | |
30 | workload health or risking major disruptions such as OOM kills. | |
31 | ||
32 | Pressure interface | |
33 | ================== | |
34 | ||
35 | Pressure information for each resource is exported through the | |
36 | respective file in /proc/pressure/ -- cpu, memory, and io. | |
37 | ||
38 | The format for CPU is as such: | |
39 | ||
40 | some avg10=0.00 avg60=0.00 avg300=0.00 total=0 | |
41 | ||
42 | and for memory and IO: | |
43 | ||
44 | some avg10=0.00 avg60=0.00 avg300=0.00 total=0 | |
45 | full avg10=0.00 avg60=0.00 avg300=0.00 total=0 | |
46 | ||
47 | The "some" line indicates the share of time in which at least some | |
48 | tasks are stalled on a given resource. | |
49 | ||
50 | The "full" line indicates the share of time in which all non-idle | |
51 | tasks are stalled on a given resource simultaneously. In this state | |
52 | actual CPU cycles are going to waste, and a workload that spends | |
53 | extended time in this state is considered to be thrashing. This has | |
54 | severe impact on performance, and it's useful to distinguish this | |
55 | situation from a state where some tasks are stalled but the CPU is | |
56 | still doing productive work. As such, time spent in this subset of the | |
57 | stall state is tracked separately and exported in the "full" averages. | |
58 | ||
be87ab0a WL |
59 | The ratios (in %) are tracked as recent trends over ten, sixty, and |
60 | three hundred second windows, which gives insight into short term events | |
61 | as well as medium and long term trends. The total absolute stall time | |
62 | (in us) is tracked and exported as well, to allow detection of latency | |
63 | spikes which wouldn't necessarily make a dent in the time averages, | |
64 | or to average trends over custom time frames. | |
2ce7135a | 65 | |
0e94682b SB |
66 | Monitoring for pressure thresholds |
67 | ================================== | |
68 | ||
69 | Users can register triggers and use poll() to be woken up when resource | |
70 | pressure exceeds certain thresholds. | |
71 | ||
72 | A trigger describes the maximum cumulative stall time over a specific | |
73 | time window, e.g. 100ms of total stall time within any 500ms window to | |
74 | generate a wakeup event. | |
75 | ||
76 | To register a trigger user has to open psi interface file under | |
77 | /proc/pressure/ representing the resource to be monitored and write the | |
78 | desired threshold and time window. The open file descriptor should be | |
79 | used to wait for trigger events using select(), poll() or epoll(). | |
80 | The following format is used: | |
81 | ||
82 | <some|full> <stall amount in us> <time window in us> | |
83 | ||
84 | For example writing "some 150000 1000000" into /proc/pressure/memory | |
85 | would add 150ms threshold for partial memory stall measured within | |
86 | 1sec time window. Writing "full 50000 1000000" into /proc/pressure/io | |
87 | would add 50ms threshold for full io stall measured within 1sec time window. | |
88 | ||
89 | Triggers can be set on more than one psi metric and more than one trigger | |
90 | for the same psi metric can be specified. However for each trigger a separate | |
91 | file descriptor is required to be able to poll it separately from others, | |
92 | therefore for each trigger a separate open() syscall should be made even | |
93 | when opening the same psi interface file. | |
94 | ||
95 | Monitors activate only when system enters stall state for the monitored | |
96 | psi metric and deactivates upon exit from the stall state. While system is | |
97 | in the stall state psi signal growth is monitored at a rate of 10 times per | |
98 | tracking window. | |
99 | ||
100 | The kernel accepts window sizes ranging from 500ms to 10s, therefore min | |
101 | monitoring update interval is 50ms and max is 1s. Min limit is set to | |
102 | prevent overly frequent polling. Max limit is chosen as a high enough number | |
103 | after which monitors are most likely not needed and psi averages can be used | |
104 | instead. | |
105 | ||
106 | When activated, psi monitor stays active for at least the duration of one | |
107 | tracking window to avoid repeated activations/deactivations when system is | |
108 | bouncing in and out of the stall state. | |
109 | ||
110 | Notifications to the userspace are rate-limited to one per tracking window. | |
111 | ||
112 | The trigger will de-register when the file descriptor used to define the | |
113 | trigger is closed. | |
114 | ||
115 | Userspace monitor usage example | |
116 | =============================== | |
117 | ||
118 | #include <errno.h> | |
119 | #include <fcntl.h> | |
120 | #include <stdio.h> | |
121 | #include <poll.h> | |
122 | #include <string.h> | |
123 | #include <unistd.h> | |
124 | ||
125 | /* | |
126 | * Monitor memory partial stall with 1s tracking window size | |
127 | * and 150ms threshold. | |
128 | */ | |
129 | int main() { | |
130 | const char trig[] = "some 150000 1000000"; | |
131 | struct pollfd fds; | |
132 | int n; | |
133 | ||
134 | fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK); | |
135 | if (fds.fd < 0) { | |
136 | printf("/proc/pressure/memory open error: %s\n", | |
137 | strerror(errno)); | |
138 | return 1; | |
139 | } | |
140 | fds.events = POLLPRI; | |
141 | ||
142 | if (write(fds.fd, trig, strlen(trig) + 1) < 0) { | |
143 | printf("/proc/pressure/memory write error: %s\n", | |
144 | strerror(errno)); | |
145 | return 1; | |
146 | } | |
147 | ||
148 | printf("waiting for events...\n"); | |
149 | while (1) { | |
150 | n = poll(&fds, 1, -1); | |
151 | if (n < 0) { | |
152 | printf("poll error: %s\n", strerror(errno)); | |
153 | return 1; | |
154 | } | |
155 | if (fds.revents & POLLERR) { | |
156 | printf("got POLLERR, event source is gone\n"); | |
157 | return 0; | |
158 | } | |
159 | if (fds.revents & POLLPRI) { | |
160 | printf("event triggered!\n"); | |
161 | } else { | |
162 | printf("unknown event received: 0x%x\n", fds.revents); | |
163 | return 1; | |
164 | } | |
165 | } | |
166 | ||
167 | return 0; | |
168 | } | |
169 | ||
2ce7135a JW |
170 | Cgroup2 interface |
171 | ================= | |
172 | ||
173 | In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem | |
174 | mounted, pressure stall information is also tracked for tasks grouped | |
175 | into cgroups. Each subdirectory in the cgroupfs mountpoint contains | |
176 | cpu.pressure, memory.pressure, and io.pressure files; the format is | |
177 | the same as the /proc/pressure/ files. | |
0e94682b SB |
178 | |
179 | Per-cgroup psi monitors can be specified and used the same way as | |
180 | system-wide ones. |