Let's say a user has two services with ManagedOOMMemoryPressure=kill,
one for a web server under system.slice, and one for a batch job under
user.slice. The batch job is causing severe memory pressure, whereas the
web server's cgroup has no processes with significant pgscan activity.
In the code, monitor_memory_pressure_contexts_handler() iterates over
all pressure targets that have exceeded their limits. When
oomd_select_by_pgscan_rate() returns 0 (that is, no candidates) for a
target, we return from the entire SET_FOREACH loop instead of moving to
the next target. Since SET_FOREACH iteration order is hash-dependent, if
the web server target happens to be visited first, oomd will find no
kill candidates for it and exit the loop. The batch job target that is
actually slamming the machine will never even be evaluated, and can
continue to wreak havoc without any intervention.
The effect is that oomd non-deterministically and silently fails to kill
cgroups that it should actually kill, allowing memory pressure to
persist and dangerously build up on the machine.
The fix is simple, keep evaluating remaining targets when one does not
match.
return log_error_errno(r, "Failed to select any cgroups based on swap, ignoring: %m");
if (r == 0) {
log_debug("No cgroup candidates found for memory pressure-based OOM action for %s", t->path);
- return 0;
+ continue;
}
r = oomd_cgroup_kill_mark(m, selected, "memory-pressure");