]> git.ipfire.org Git - thirdparty/samba.git/commit
ctdb-daemon: Avoid aborting during early shutdown
authorMartin Schwenke <mschwenke@ddn.com>
Wed, 21 May 2025 12:17:42 +0000 (22:17 +1000)
committerJule Anger <janger@samba.org>
Thu, 5 Jun 2025 10:57:15 +0000 (10:57 +0000)
commitb0a66c42704b51329bee68cf49d9f9ce7de3b1d3
treed705c2e8bbcc0e75b7056f07b34746baa97e89d9
parent6f21f9527d6f1c6046bfc8bb5495eef8edbcfb08
ctdb-daemon: Avoid aborting during early shutdown

An early shutdown can put ctdbd into SHUTDOWN runstate before ctdbd
has completed all early initialisation.  Some of the start-time
transitions then attempt to set the runstate to FIRST_RECOVERY or
RUNNING, which would make the runstate go backwards, so ctdbd aborts.

Upcoming changes cause ctdbd shutdown to take longer, so the problem
will become more likely.  With those changes, this can be
unreliably (50% of the time?)  triggered by:

  ctdb/tests/INTEGRATION/simple/cluster.091.version_check.sh

since it does an early shutdown due to a version mismatch.

Avoid this by noticing when the runstate is SHUTDOWN and refusing to
continue with subsequent early initialisation steps, which aren't
needed when shutting down.

Earlier runstate transitions do not seems likely to cause an abort
during early shutdown.  The following:

  ./tests/local_daemons.sh foo start 0; ./tests/local_daemons.sh foo stop 0

sees ctdbd already into FIRST_RECOVERY before the shutdown is
processed.

The change to ctdb_run_startup() probably isn't strictly necessary.
There will be no abort in this case.  ctdb_shutdown_sequence() will
always run the "shutdown" event and then stop the event daemon, so it
doesn't seem possible that services could be left running.  However,
we might as well avoid running the "startup" event when shutting down,
even if only to avoid confusing logs.

Ultimately, it seems like some redesign would be needed to avoid this
in a more predictable manner, rather than responding when an early
initialisation step inconveniently completes during shutdown.  For
example, hanging a lot of the start-time event handling off a common
talloc context, could allow it to be cancelled with a single
TALLOC_FREE().  However, a change like that would involve a lot of
analysis to ensure that the talloc hierarchy is correct and there is
no change of free'd pointers being dereferenced.  So, we're probably
better off just keeping this issue in mind during a broader redesign.

This workaround appears to be sufficient.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=15858

Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
(cherry picked from commit c03e6b9d50cac67fe33dc6b120996d1915331be6)
ctdb/server/ctdb_monitor.c