mdcheck: simplify start / continue logic and add "--restart" logic
The current logic of "mdcheck" is susceptible to races when multiple
mdcheck instances run simultaneously, as checks can be initiated from both
"mdcheck_start.service" and "mdcheck_continue.service".
The previous commit
8aa4ea95db35 ("systemd: start mdcheck_continue.timer
before mdcheck_start.timer") fixed this for the default configuration by
inverting the ordering of timers. But users can customize the timer
settings, which can cause the race to reappear.
This patch avoids this kind of race entirely, by changing the logic as
follows:
* When `mdcheck` has finished checking a RAID array, it will create a
marker `/var/lib/mdcheckChecked_$UUID`.
* A new option `--restart` is introduced. `mdcheck --restart` removes all
`/var/lib/mdcheck/Checked_*` markers.
This is called from `mdcheck_start.service`, which is typically started
by a timer in large time intervals (default once per month).
* `mdcheck --continue` works as it used to. It continues previously started
checks (where the `/var/lib/mdcheck/MD_UUID_$UUID` file is present and
contains a start position) for the check.
This usage is *not recommended any more*.
* `mdcheck` with no arguments is like `--continue`, but it also starts new
checks for all arrays for which no check has previously been
started, *except* for arrays for which a marker
`/var/lib/mdcheck/Checked_$UUID` exists.
`mdcheck_continue.service` calls `mdcheck` this way. It is called in
short time intervals, by default once per day.
* Combining `--restart` and `--continue` is an error.
This way, the only systemd service that actually triggers a kernel-level
RAID check is `mdcheck_continue.service`, which avoids races.
When all checks have finished, `mdcheck_continue.service` is no-op.
When `mdcheck_start.service` runs, the checks re re-enabled and will be
started from 0 by the next `mdcheck_continue.service` invocation.
Signed-off-by: Martin Wilck <mwilck@suse.com>