From: Martin Schwenke Date: Wed, 21 May 2025 12:17:42 +0000 (+1000) Subject: ctdb-daemon: Avoid aborting during early shutdown X-Git-Tag: samba-4.21.6~9 X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=1e773a73529ab14defa1c9862758e1300e38850e;p=thirdparty%2Fsamba.git ctdb-daemon: Avoid aborting during early shutdown An early shutdown can put ctdbd into SHUTDOWN runstate before ctdbd has completed all early initialisation. Some of the start-time transitions then attempt to set the runstate to FIRST_RECOVERY or RUNNING, which would make the runstate go backwards, so ctdbd aborts. Upcoming changes cause ctdbd shutdown to take longer, so the problem will become more likely. With those changes, this can be unreliably (50% of the time?) triggered by: ctdb/tests/INTEGRATION/simple/cluster.091.version_check.sh since it does an early shutdown due to a version mismatch. Avoid this by noticing when the runstate is SHUTDOWN and refusing to continue with subsequent early initialisation steps, which aren't needed when shutting down. Earlier runstate transitions do not seems likely to cause an abort during early shutdown. The following: ./tests/local_daemons.sh foo start 0; ./tests/local_daemons.sh foo stop 0 sees ctdbd already into FIRST_RECOVERY before the shutdown is processed. The change to ctdb_run_startup() probably isn't strictly necessary. There will be no abort in this case. ctdb_shutdown_sequence() will always run the "shutdown" event and then stop the event daemon, so it doesn't seem possible that services could be left running. However, we might as well avoid running the "startup" event when shutting down, even if only to avoid confusing logs. Ultimately, it seems like some redesign would be needed to avoid this in a more predictable manner, rather than responding when an early initialisation step inconveniently completes during shutdown. For example, hanging a lot of the start-time event handling off a common talloc context, could allow it to be cancelled with a single TALLOC_FREE(). However, a change like that would involve a lot of analysis to ensure that the talloc hierarchy is correct and there is no change of free'd pointers being dereferenced. So, we're probably better off just keeping this issue in mind during a broader redesign. This workaround appears to be sufficient. BUG: https://bugzilla.samba.org/show_bug.cgi?id=15858 Signed-off-by: Martin Schwenke Reviewed-by: Amitay Isaacs (cherry picked from commit c03e6b9d50cac67fe33dc6b120996d1915331be6) --- diff --git a/ctdb/server/ctdb_monitor.c b/ctdb/server/ctdb_monitor.c index ab58ec485fe..869a589e6e5 100644 --- a/ctdb/server/ctdb_monitor.c +++ b/ctdb/server/ctdb_monitor.c @@ -217,6 +217,11 @@ static void ctdb_run_startup(struct tevent_context *ev, */ static void ctdb_startup_callback(struct ctdb_context *ctdb, int status, void *p) { + if (ctdb->runstate == CTDB_RUNSTATE_SHUTDOWN) { + DBG_WARNING("Detected early shutdown, not starting monitoring\n"); + return; + } + if (status != 0) { DEBUG(DEBUG_ERR,("startup event failed\n")); tevent_add_timer(ctdb->ev, ctdb->monitor->monitor_context, @@ -249,6 +254,12 @@ static void ctdb_run_startup(struct tevent_context *ev, struct ctdb_context); int ret; + if (ctdb->runstate == CTDB_RUNSTATE_SHUTDOWN) { + DBG_WARNING( + "Detected early shutdown, not running startup event\n"); + return; + } + /* This is necessary to avoid the "startup" event colliding * with the "ipreallocated" event from the takeover run * following the first recovery. We might as well serialise @@ -432,6 +443,13 @@ void ctdb_stop_monitoring(struct ctdb_context *ctdb) */ void ctdb_wait_for_first_recovery(struct ctdb_context *ctdb) { + if (ctdb->runstate == CTDB_RUNSTATE_SHUTDOWN) { + DBG_WARNING( + "Detected early shutdown, " + "not waiting for first recovery\n"); + return; + } + ctdb_set_runstate(ctdb, CTDB_RUNSTATE_FIRST_RECOVERY); ctdb->monitor = talloc(ctdb, struct ctdb_monitor_state);