[thirdparty/kernel/stable.git] / Documentation / accounting / psi.txt

================================
PSI - Pressure Stall Information
================================

:Date: April, 2018
:Author: Johannes Weiner <hannes@cmpxchg.org>

When CPU, memory or IO devices are contended, workloads experience
latency spikes, throughput losses, and run the risk of OOM kills.

Without an accurate measure of such contention, users are forced to
either play it safe and under-utilize their hardware resources, or
roll the dice and frequently suffer the disruptions resulting from
excessive overcommit.

The psi feature identifies and quantifies the disruptions caused by
such resource crunches and the time impact it has on complex workloads
or even entire systems.

Having an accurate measure of productivity losses caused by resource
scarcity aids users in sizing workloads to hardware--or provisioning
hardware according to workload demand.

As psi aggregates this information in realtime, systems can be managed
dynamically using techniques such as load shedding, migrating jobs to
other systems or data centers, or strategically pausing or killing low
priority or restartable batch jobs.

This allows maximizing hardware utilization without sacrificing
workload health or risking major disruptions such as OOM kills.

Pressure interface
==================

Pressure information for each resource is exported through the
respective file in /proc/pressure/ -- cpu, memory, and io.

The format for CPU is as such:

some avg10=0.00 avg60=0.00 avg300=0.00 total=0

and for memory and IO:

some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

The "some" line indicates the share of time in which at least some
tasks are stalled on a given resource.

The "full" line indicates the share of time in which all non-idle
tasks are stalled on a given resource simultaneously. In this state
actual CPU cycles are going to waste, and a workload that spends
extended time in this state is considered to be thrashing. This has
severe impact on performance, and it's useful to distinguish this
situation from a state where some tasks are stalled but the CPU is
still doing productive work. As such, time spent in this subset of the
stall state is tracked separately and exported in the "full" averages.

The ratios (in %) are tracked as recent trends over ten, sixty, and
three hundred second windows, which gives insight into short term events
as well as medium and long term trends. The total absolute stall time
(in us) is tracked and exported as well, to allow detection of latency
spikes which wouldn't necessarily make a dent in the time averages,
or to average trends over custom time frames.

Monitoring for pressure thresholds
==================================

Users can register triggers and use poll() to be woken up when resource
pressure exceeds certain thresholds.

A trigger describes the maximum cumulative stall time over a specific
time window, e.g. 100ms of total stall time within any 500ms window to
generate a wakeup event.

To register a trigger user has to open psi interface file under
/proc/pressure/ representing the resource to be monitored and write the
desired threshold and time window. The open file descriptor should be
used to wait for trigger events using select(), poll() or epoll().
The following format is used:

<some|full> <stall amount in us> <time window in us>

For example writing "some 150000 1000000" into /proc/pressure/memory
would add 150ms threshold for partial memory stall measured within
1sec time window. Writing "full 50000 1000000" into /proc/pressure/io
would add 50ms threshold for full io stall measured within 1sec time window.

Triggers can be set on more than one psi metric and more than one trigger
for the same psi metric can be specified. However for each trigger a separate
file descriptor is required to be able to poll it separately from others,
therefore for each trigger a separate open() syscall should be made even
when opening the same psi interface file.

Monitors activate only when system enters stall state for the monitored
psi metric and deactivates upon exit from the stall state. While system is
in the stall state psi signal growth is monitored at a rate of 10 times per
tracking window.

The kernel accepts window sizes ranging from 500ms to 10s, therefore min
monitoring update interval is 50ms and max is 1s. Min limit is set to
prevent overly frequent polling. Max limit is chosen as a high enough number
after which monitors are most likely not needed and psi averages can be used
instead.

When activated, psi monitor stays active for at least the duration of one
tracking window to avoid repeated activations/deactivations when system is
bouncing in and out of the stall state.

Notifications to the userspace are rate-limited to one per tracking window.

The trigger will de-register when the file descriptor used to define the
trigger  is closed.

Userspace monitor usage example
===============================

#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <poll.h>
#include <string.h>
#include <unistd.h>

/*
 * Monitor memory partial stall with 1s tracking window size
 * and 150ms threshold.
 */
int main() {
	const char trig[] = "some 150000 1000000";
	struct pollfd fds;
	int n;

	fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
	if (fds.fd < 0) {
		printf("/proc/pressure/memory open error: %s\n",
			strerror(errno));
		return 1;
	}
	fds.events = POLLPRI;

	if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
		printf("/proc/pressure/memory write error: %s\n",
			strerror(errno));
		return 1;
	}

	printf("waiting for events...\n");
	while (1) {
		n = poll(&fds, 1, -1);
		if (n < 0) {
			printf("poll error: %s\n", strerror(errno));
			return 1;
		}
		if (fds.revents & POLLERR) {
			printf("got POLLERR, event source is gone\n");
			return 0;
		}
		if (fds.revents & POLLPRI) {
			printf("event triggered!\n");
		} else {
			printf("unknown event received: 0x%x\n", fds.revents);
			return 1;
		}
	}

	return 0;
}

Cgroup2 interface
=================

In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
mounted, pressure stall information is also tracked for tasks grouped
into cgroups. Each subdirectory in the cgroupfs mountpoint contains
cpu.pressure, memory.pressure, and io.pressure files; the format is
the same as the /proc/pressure/ files.

Per-cgroup psi monitors can be specified and used the same way as
system-wide ones.
Commit	Line	Data
eb414681 JW	1	================================
	2	PSI - Pressure Stall Information
	3	================================
	4
	5	:Date: April, 2018
	6	:Author: Johannes Weiner <hannes@cmpxchg.org>
	7
	8	When CPU, memory or IO devices are contended, workloads experience
	9	latency spikes, throughput losses, and run the risk of OOM kills.
	10
	11	Without an accurate measure of such contention, users are forced to
	12	either play it safe and under-utilize their hardware resources, or
	13	roll the dice and frequently suffer the disruptions resulting from
	14	excessive overcommit.
	15
	16	The psi feature identifies and quantifies the disruptions caused by
	17	such resource crunches and the time impact it has on complex workloads
	18	or even entire systems.
	19
	20	Having an accurate measure of productivity losses caused by resource
	21	scarcity aids users in sizing workloads to hardware--or provisioning
	22	hardware according to workload demand.
	23
	24	As psi aggregates this information in realtime, systems can be managed
	25	dynamically using techniques such as load shedding, migrating jobs to
	26	other systems or data centers, or strategically pausing or killing low
	27	priority or restartable batch jobs.
	28
	29	This allows maximizing hardware utilization without sacrificing
	30	workload health or risking major disruptions such as OOM kills.
	31
	32	Pressure interface
	33	==================
	34
	35	Pressure information for each resource is exported through the
	36	respective file in /proc/pressure/ -- cpu, memory, and io.
	37
	38	The format for CPU is as such:
	39
	40	some avg10=0.00 avg60=0.00 avg300=0.00 total=0
	41
	42	and for memory and IO:
	43
	44	some avg10=0.00 avg60=0.00 avg300=0.00 total=0
	45	full avg10=0.00 avg60=0.00 avg300=0.00 total=0
	46
	47	The "some" line indicates the share of time in which at least some
	48	tasks are stalled on a given resource.
	49
	50	The "full" line indicates the share of time in which all non-idle
	51	tasks are stalled on a given resource simultaneously. In this state
	52	actual CPU cycles are going to waste, and a workload that spends
	53	extended time in this state is considered to be thrashing. This has
	54	severe impact on performance, and it's useful to distinguish this
	55	situation from a state where some tasks are stalled but the CPU is
	56	still doing productive work. As such, time spent in this subset of the
	57	stall state is tracked separately and exported in the "full" averages.
	58
be87ab0a WL	59	The ratios (in %) are tracked as recent trends over ten, sixty, and
	60	three hundred second windows, which gives insight into short term events
	61	as well as medium and long term trends. The total absolute stall time
	62	(in us) is tracked and exported as well, to allow detection of latency
	63	spikes which wouldn't necessarily make a dent in the time averages,
	64	or to average trends over custom time frames.
2ce7135a	65
0e94682b SB	66	Monitoring for pressure thresholds
	67	==================================
	68
	69	Users can register triggers and use poll() to be woken up when resource
	70	pressure exceeds certain thresholds.
	71
	72	A trigger describes the maximum cumulative stall time over a specific
	73	time window, e.g. 100ms of total stall time within any 500ms window to
	74	generate a wakeup event.
	75
	76	To register a trigger user has to open psi interface file under
	77	/proc/pressure/ representing the resource to be monitored and write the
	78	desired threshold and time window. The open file descriptor should be
	79	used to wait for trigger events using select(), poll() or epoll().
	80	The following format is used:
	81
	82	<some\|full> <stall amount in us> <time window in us>
	83
	84	For example writing "some 150000 1000000" into /proc/pressure/memory
	85	would add 150ms threshold for partial memory stall measured within
	86	1sec time window. Writing "full 50000 1000000" into /proc/pressure/io
	87	would add 50ms threshold for full io stall measured within 1sec time window.
	88
	89	Triggers can be set on more than one psi metric and more than one trigger
	90	for the same psi metric can be specified. However for each trigger a separate
	91	file descriptor is required to be able to poll it separately from others,
	92	therefore for each trigger a separate open() syscall should be made even
	93	when opening the same psi interface file.
	94
	95	Monitors activate only when system enters stall state for the monitored
	96	psi metric and deactivates upon exit from the stall state. While system is
	97	in the stall state psi signal growth is monitored at a rate of 10 times per
	98	tracking window.
	99
	100	The kernel accepts window sizes ranging from 500ms to 10s, therefore min
	101	monitoring update interval is 50ms and max is 1s. Min limit is set to
	102	prevent overly frequent polling. Max limit is chosen as a high enough number
	103	after which monitors are most likely not needed and psi averages can be used
	104	instead.
	105
	106	When activated, psi monitor stays active for at least the duration of one
	107	tracking window to avoid repeated activations/deactivations when system is
	108	bouncing in and out of the stall state.
	109
	110	Notifications to the userspace are rate-limited to one per tracking window.
	111
	112	The trigger will de-register when the file descriptor used to define the
	113	trigger is closed.
	114
	115	Userspace monitor usage example
	116	===============================
	117
	118	#include <errno.h>
	119	#include <fcntl.h>
	120	#include <stdio.h>
	121	#include <poll.h>
	122	#include <string.h>
	123	#include <unistd.h>
	124
	125	/*
	126	* Monitor memory partial stall with 1s tracking window size
	127	* and 150ms threshold.
	128	*/
	129	int main() {
130	const char trig[] = "some 150000 1000000";
131	struct pollfd fds;
132	int n;
133
134	fds.fd = open("/proc/pressure/memory", O_RDWR \| O_NONBLOCK);
135	if (fds.fd < 0) {
136	printf("/proc/pressure/memory open error: %s\n",
137	strerror(errno));
138	return 1;
139	}
140	fds.events = POLLPRI;
141
142	if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
143	printf("/proc/pressure/memory write error: %s\n",
144	strerror(errno));
145	return 1;
146	}
147
148	printf("waiting for events...\n");
149	while (1) {
150	n = poll(&fds, 1, -1);
151	if (n < 0) {
152	printf("poll error: %s\n", strerror(errno));
153	return 1;
154	}
155	if (fds.revents & POLLERR) {
156	printf("got POLLERR, event source is gone\n");
157	return 0;
158	}
159	if (fds.revents & POLLPRI) {
160	printf("event triggered!\n");
161	} else {
162	printf("unknown event received: 0x%x\n", fds.revents);
163	return 1;
164	}
165	}
166
167	return 0;
168	}
169
2ce7135a JW	170	Cgroup2 interface
	171	=================
	172
	173	In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
	174	mounted, pressure stall information is also tracked for tasks grouped
	175	into cgroups. Each subdirectory in the cgroupfs mountpoint contains
	176	cpu.pressure, memory.pressure, and io.pressure files; the format is
	177	the same as the /proc/pressure/ files.
0e94682b SB	178
	179	Per-cgroup psi monitors can be specified and used the same way as
	180	system-wide ones.