oomd: Fix unnecessary delays during OOM kills with pending kills present
Let's say a user has two services with ManagedOOMMemoryPressure=kill,
perhaps a web server under system.slice and a batch job under
user.slice. Both exceed their pressure limits. On the previous timer
tick, oomd has already queued the web server's candidate for killing,
but the prekill hook has not yet responded, so the kill is still
pending.
In the code, monitor_memory_pressure_contexts_handler() iterates over
all pressure targets that have exceeded their limits. When it reaches
the web server target and calls oomd_cgroup_kill_mark(), which returns 0
because that cgroup is already queued. The code treats this the same as
a successful new kill: it resets the 15 second delay timer and returns
from the function, exiting the loop.
This loop is handled by SET_FOREACH and the iteration order is
hash-dependent. As such, if the web server target happens coincidentally
to be visited first, oomd never evaluates the batch job target at all.
The effect is twofold:
1. oomd stalls for 15 seconds despite not having initiated any new kill.
That can unnecessarily delay further action to stem increases in
memory pressure. The delay exists to let stale pressure counters
settle after a kill, but no kill has happened here.
2. It non-deterministically skips pressure targets that may have
unqueued candidates, dangerously allowing memory pressure to persist
for longer than it should.
Fix this by skipping cgroups that are already queued so the loop
proceeds to try other pressure targets. We should only delay when a new
kill mark is actually created.