From: Willy Tarreau Date: Mon, 27 Feb 2023 17:43:38 +0000 (+0100) Subject: BUG/MEDIUM: fd: make fd_delete() support being called from a different group X-Git-Tag: v2.8-dev5~83 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=061754b249b9903913d6766c1ab31bb393ee5c0d;p=thirdparty%2Fhaproxy.git BUG/MEDIUM: fd: make fd_delete() support being called from a different group There's currently a problem affecting thread groups. Stopping a listener from a different group than the one that runs this listener will trigger the BUG_ON() in fd_delete(). This typically happens by issuing "disable frontend f" on the CLI for the following config since the CLI runs on group 1: global nbthread 2 thread-groups 2 stats socket /tmp/sock1 level admin frontend f mode http bind abns@frt-sock thread 2 This happens because abns sockets cannot be suspended so here this requires a full stop. A first approach would consist in isolating the caller during such rare operations but it turns out that fd_delete() is not robust against even such calling conditions, because it uses its own thread mask with an FD that may be in a different group, and even though the threads would be isolated and running_mask should be zero, we must not mix thread masks from different groups like this. A better solution consists in replacing the bug condition detection with a self-protection. After all it's not trivial to figure all likely call places, and forcing upper layers to protect the code is not clean if we can do it at the bottom. Thus this is what is being done now. We detect a thread group mismatch, and if so, we forcefully isolate ourselves and entirely clean the socket. This has the merit of being much more robust and easier to use (and harder to misuse). Given that such operations are very rare (actually when they happen a crash follows), it's not a problem to waste some time isolating the caller there. This must be backported to 2.7, along with this previous patch: BUG/MINOR: fd: used the update list from the fd's group instead of tgid --- diff --git a/src/fd.c b/src/fd.c index 89fdfb3c65..db866ee4c7 100644 --- a/src/fd.c +++ b/src/fd.c @@ -348,7 +348,11 @@ void _fd_delete_orphan(int fd) } /* Deletes an FD from the fdsets. The file descriptor is also closed, possibly - * asynchronously. Only the owning thread may do this. + * asynchronously. It is safe to call it from another thread from the same + * group as the FD's or from a thread from a different group. However if called + * from a thread from another group, there is an extra cost involved because + * the operation is performed under thread isolation, so doing so must be + * reserved for ultra-rare cases (e.g. stopping a listener). */ void fd_delete(int fd) { @@ -367,8 +371,27 @@ void fd_delete(int fd) /* the tgid cannot change before a complete close so we should never * face the situation where we try to close an fd that was reassigned. + * However there is one corner case where this happens, it's when an + * attempt to pause a listener fails (e.g. abns), leaving the listener + * in fault state and it is forcefully stopped. This needs to be done + * under isolation, and it's quite rare (i.e. once per such FD per + * process). Since we'll be isolated we can clear the thread mask and + * close the FD ourselves. */ - BUG_ON(fd_tgid(fd) != ti->tgid && !thread_isolated() && !(global.mode & MODE_STOPPING)); + if (unlikely(fd_tgid(fd) != ti->tgid)) { + int must_isolate = !thread_isolated() && !(global.mode & MODE_STOPPING); + + if (must_isolate) + thread_isolate(); + + HA_ATOMIC_STORE(&fdtab[fd].thread_mask, 0); + HA_ATOMIC_STORE(&fdtab[fd].running_mask, 0); + _fd_delete_orphan(fd); + + if (must_isolate) + thread_release(); + return; + } /* we must postpone removal of an FD that may currently be in use * by another thread. This can happen in the following two situations: