From: Tony Finch <fanf@isc.org>
Date: Thu, 29 Dec 2022 19:18:00 +0000 (+0000)
Subject: QSBR: safe memory reclamation for lock-free data structures
X-Git-Tag: v9.19.11~32^2
X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=9b7aa536ba7219d4d113d9cc4030371a88ce35c6;p=thirdparty%2Fbind9.git

QSBR: safe memory reclamation for lock-free data structures

This "quiescent state based reclamation" module provides support for
the qp-trie module in dns/qp. It is a replacement for liburcu, written
without reference to the urcu source code, and in fact it works in a
significantly different way.

A few specifics of BIND make this variant of QSBR somewhat simpler:

  * We can require that wait-free access to a qp-trie only happens in
    an isc_loop callback. The loop provides a natural quiescent state,
    after the callbacks are done, when no qp-trie access occurs.

  * We can dispense with any API like rcu_synchronize(). In practice,
    it takes far too long to wait for a grace period to elapse for each
    write to a data structure.

  * We use the idea of "phases" (aka epochs or eras) from EBR to
    reduce the amount of bookkeeping needed to track memory that is no
    longer needed, knowing that the qp-trie does most of that work
    already.

I considered hazard pointers for safe memory reclamation. They have
more read-side overhead (updating the hazard pointers) and it wasn't
clear to me how to nicely schedule the cleanup work. Another
alternative, epoch-based reclamation, is designed for fine-grained
lock-free updates, so it needs some rethinking to work well with the
heavily read-biased design of the qp-trie. QSBR has the fastest read
side of the basic SMR algorithms (with no barriers), and fits well
into a libuv loop. More recent hybrid SMR algorithms do not appear to
have enough benefits to justify the extra complexity.
---

diff --git a/CHANGES b/CHANGES
index 2f23b7a7d09..620b8b886d6 100644
--- a/CHANGES
+++ b/CHANGES
@@ -1,3 +1,6 @@
+6109.	[func]		Infrastructure for QSBR, asynchronous safe memory
+			reclamation for lock-free data structures. [GL !7471]
+
 6108.	[func]		Support for simple lock-free singly-linked stacks.
 			[GL !7470]
 
diff --git a/doc/dev/qsbr.md b/doc/dev/qsbr.md
new file mode 100644
index 00000000000..7880a56bd41
--- /dev/null
+++ b/doc/dev/qsbr.md
@@ -0,0 +1,397 @@
+<!--
+Copyright (C) Internet Systems Consortium, Inc. ("ISC")
+
+SPDX-License-Identifier: MPL-2.0
+
+This Source Code Form is subject to the terms of the Mozilla Public
+License, v. 2.0.  If a copy of the MPL was not distributed with this
+file, you can obtain one at https://mozilla.org/MPL/2.0/.
+
+See the COPYRIGHT file distributed with this work for additional
+information regarding copyright ownership.
+-->
+
+QSBR: quiescent state based reclamation
+=======================================
+
+QSBR is a safe memory reclamation (SMR) algorithm for lock-free data
+structures such as a qp-trie. (See `doc/dev/qp.md`.)
+
+When an object is unlinked from a lock-free data structure, it
+cannot be `free()`ed immediately, because there can still be readers
+accessing the object via an old version of the data structure. SMR
+algorithms determine when it is safe to reclaim memory after it has
+been unlinked.
+
+
+Introductions and overviews
+---------------------------
+
+There is a terse overview in `include/isc/qsbr.h`.
+
+Jeff Preshing has a nice introduction to QSBR,
+_<https://preshing.com/20160726/using-quiescent-states-to-reclaim-memory/>_
+
+At the end of this note is a copy of a blog post about writing BIND's
+`isc_qsbr`, _<https://dotat.at/@/2023-01-10-qsbr.html>_
+
+[Paul McKenney's web page][paulmck] has links to his book on
+concurrent programming, the [Userspace RCU library][urcu], and more.
+McKenney invented RCU and QSBR. RCU is the Linux kernel's machinery
+for lock-free data structures and safe memory reclamation, based on
+QSBR.
+
+[paulmck]: http://www.rdrop.com/~paulmck/
+[urcu]: https://liburcu.org/
+
+
+Example code
+------------
+
+If you are implementing a lock-free data structure that needs safe
+memory reclamation, here's a guide to using `isc_qsbr`, based on how
+QSBR is used by `dns_qp`.
+
+### registration
+
+When the program starts up you need to register a global callback
+function that will reclaim unused memory. You can do so using an
+ISC_CONSTRUCTOR function that runs automatically at startup.
+
+        static void
+        qp_qsbr_register(void) ISC_CONSTRUCTOR;
+        static void
+        qp_qsbr_register(void) {
+            isc_qsbr_register(qp_qsbr_reclaimer);
+        }
+
+### work list
+
+Your module will need somewhere that your callback can find the work
+it needs to do. The qp-trie has an atomic list of `dns_qpmulti_t`
+objects for this purpose.
+
+        /* a global variable */
+        static ISC_ASTACK(dns_qpmulti_t) qsbr_work;
+
+The reason for using global variables is so that we don't need to
+allocate a thunk every time we have memory reclamation work to do.
+
+### read-only access
+
+You should design your data structure so that it has a single atomic
+root pointer referring to its current version. A lock-free reader
+_must_ run in an `isc_loop` callback. It gains access to the data
+structure by taking a copy of this pointer:
+
+        qp_node_t *reader = atomic_load_acquire(&multi->reader);
+
+During an `isc_loop` callback, a reader should keep using the same
+pointer go get a consistent view of the data structure. If it reloads
+the pointer it can get a different version changed by concurrent
+writers.
+
+A reader _must_ stop using the root pointer and any interior pointers
+obtained via the root pointer before it returns to the `isc_loop`.
+
+### modifications and writes
+
+All changes to the data structure must be copy-on-write (aka
+read-copy-update) so that concurrent readers are not disturbed.
+
+When a new version of the data structure has been prepared, it is
+committed by overwriting the atomic root pointer,
+
+        atomic_store_release(&multi->reader, reader); /* COMMIT */
+
+### scheduling cleanup
+
+After committing a change, your data structure may have memory that
+will become free, after concurrent readers have stopped accessing it.
+To reclaim the memory when it is safe, use code like:
+
+        isc_qsbr_phase_t phase = isc_qsbr_phase(multi->loopmgr);
+        if (defer_chunk_reclamation(qp, phase)) {
+            ISC_ASTACK_ADD(qsbr_work, multi, cleanup);
+            isc_qsbr_activate(multi->loopmgr, phase);
+        }
+
+  * First, get the current QSBR phase
+
+  * Second, mark free memory with the phase number. The qp-trie scans
+    its chunks and marks those that will become free, and returns
+    `true` if there is cleanup work to do.
+
+  * If so, the qp-trie is added to the work list. (`ISC_ALIST_ADD()`
+    is idempotent).
+
+  * Finally, QSBR is informed that there is work to do.
+
+In other cases it might not make sense to scan the data structure
+after committing, and instead you might make note of which memory to
+clean up while making changes before you know what the phase will be.
+You can then have per-phase work lists, like:
+
+        static ISC_ASTACK(my_work_t) qsbr_work[ISC_QSBR_PHASES];
+
+        isc_qsbr_phase_t phase = isc_qsbr_phase(loopmgr);
+        ISC_ASTACK_ADD(qsbr_work[phase], cleanup_work, link);
+        isc_qsbr_activate(loopmgr, phase);
+
+In general, there will be several (maybe many) write operations during
+a grace period. Your lock-free data structure should collect its
+reclamation work from all these writes into a batch per phase, i.e.
+per grace period.
+
+### reclaiming
+
+Inside the reclaimer callback, we iterate over the work list and clean
+up each item on it. If there is more cleanup work to do in another
+phase, we put the qp-trie back on the work list for another go.
+
+        static void
+        qsbreclaimer(void *arg, isc_qsbr_phase_t phase) {
+            UNUSED(arg);
+
+            ISC_STACK(dns_qpmulti_t) drain = ISC_ASTACK_TO_STACK(qsbr_work);
+            while (!ISC_STACK_EMPTY(drain)) {
+                dns_qpmulti_t *multi = ISC_STACK_POP(drain, cleanup);
+                INSIST(QPMULTI_VALID(multi));
+                LOCK(&multi->mutex);
+                if (reclaim_chunks(&multi->writer, phase)) {
+                    /* more to do next time */
+                    ISC_ALIST_PUSH(qsbr_work, multi, cleanup);
+                }
+                UNLOCK(&multi->mutex);
+            }
+        }
+
+### reclaim marks
+
+In the qp-trie data structure, each chunk has some metadata which
+includes a bitfield for the reclaim phase:
+
+        isc_qsbr_phase_t phase : ISC_QSBR_PHASE_BITS;
+
+We use a bitfield so that all the metadata fits in a single word.
+
+
+------------------------------------------------------------------------
+
+Safe memory reclamation for BIND
+================================
+
+At the end of October 2022, I _finally_ got [my multithreaded
+qp-trie][qp-gc] working! It could be built with two different
+concurrency control mechanisms:
+
+  * A reader/writer lock
+
+    This has poor read-side scalability, because every thread is
+    hammering on the same shared location. But its write performance
+    is reasonably good: concurrent readers don't slow it down too much.
+
+  * [`liburcu`, userland read-copy-update][urcu]
+
+    RCU has a fast and scalable read side, nice! But on the write side
+    I used `synchronize_rcu()`, which is blocking and rather slow, so
+    my write performance was terrible.
+
+OK, but I want the best of both worlds! To fix it, I needed to change
+the qp-trie code to use safe memory reclamation more effectively:
+instead of blocking inside `synchronize_rcu()` before cleaning up, use
+`call_rcu()` to clean up asynchronously. I expect I'll write about the
+qp-trie changes another time.
+
+Another issue is that I want the best of both worlds _by default_,
+but `liburcu` is [LGPL][] and we don't want BIND to depend on
+code whose licence demands more from our users than the [MPL][].
+
+[qp-gc]: https://dotat.at/@/2021-06-23-page-based-gc-for-qp-trie-rcu.html
+[LGPL]: https://opensource.org/licenses/LGPL-2.1
+[MPL]: https://opensource.org/licenses/MPL-2.0
+
+So I set out to write my own safe memory reclamation support code.
+
+
+lock freedom
+------------
+
+In a [multithreaded qp-trie][qp-gc], there can be many concurrent
+readers, but there can be only one writer at a time and modifications
+are strictly serialized. When I have got it working properly, readers
+are completely wait-free, unaffected by other readers, and almost
+unaffected by writers. Writers need to get a mutex to ensure there is
+only one at a time, but once the mutex is acquired, a writer is not
+obstructed by readers.
+
+The way this works is that readers use an atomic load to get a pointer
+to the root of the current version of the trie. Readers can make
+multiple queries using this root pointer and the results will be
+consistent wrt that particular version, regardless of what changes
+writers might be making concurrently. Writers do not affect readers
+because all changes are made by copy-on-write. When a writer is ready
+to commit a new version of the trie, it uses an atomic store to flip
+the root pointer.
+
+
+safe memory reclamation
+-----------------------
+
+We can't copy-on-write indefinitely: we need to reclaim the memory
+used by old versions of the trie. And we must do so "safely", i.e.
+without `free()`ing memory that readers are still using.
+
+So, before `free()`ing memory, a writer must wait for a _"grace
+period"_, which is a jargon term meaning "until readers are not using
+the old version". There are a bunch of algorithms for determining when
+a grace period is over, with varying amounts of over-approximation,
+CPU overhead, and memory backlog.
+
+The [RCU][urcu] function `synchronize_rcu()` is slow because it blocks
+waiting for a grace period; the `call_rcu()` function runs a callback
+asynchronously after a grace period has passed. I wanted to avoid
+blocking my writers, so I needed to implement something like
+`call_rcu()`.
+
+
+aversions
+---------
+
+When I started trying to work out how to do safe memory reclamation,
+it all seemed quite intimidating. But as I learned more, I found that
+my circumstances make it easier than it appeared at first.
+
+The [`liburcu`][urcu] homepage has a long list of supported CPU
+architectures and operating systems. Do I have to care about those
+details too? No! The RCU code dates back to before the age of
+standardized concurrent memory models, so the RCU developers had to
+invent their own atomic primitives and correctness rules. Twenty-ish
+years later the state of the art has advanced, so I can use
+`<stdatomic.h>` without having to re-do it like `liburcu`.
+
+You can also choose between several algorithms implemented by
+[`liburcu`][urcu], involving questions about kernel support, specially
+reserved signals, and intrusiveness in application code. But while I
+was working out how to schedule asynchronous memory reclamation work,
+I realised that BIND is already well-suited to the fastest flavour of
+RCU, called "QSBR".
+
+
+QSBR
+----
+
+QSBR stands for "quiescent state based reclamation". A _"quiescent
+state"_ is a fancy name for a point when a thread is not accessing a
+lock-free data structure, and does not retain any root pointers or
+interior pointers.
+
+When a thread has passed through a quiescent state, it no longer has
+access to older versions of the data structures. When _all_ threads
+have passed through quiescent states, then nothing in the program has
+access to old versions. This is how QSBR detects grace periods: after
+a writer commits a new version, it waits for all threads to pass
+through quiescent states, and therefore a grace period has definitely
+elapsed, and so it is then safe to reclaim the old version's memory.
+
+QSBR is fast because readers do not need to explicitly mark the
+critical section surrounding the atomic load that I mentioned earlier.
+Threads just need to pass through a quiescent state frequently enough
+that there isn't a huge build-up of unreclaimed memory.
+
+Inside an operating system kernel (RCU's native environment), a
+context switch provides a natural quiescent state. In a userland
+application, you need to find a good place to call
+`rcu_quiescent_state()`. You could call it every time you have
+finished using a root pointer, but marking a quiescent state is not
+completely free, so there are probably more efficient ways.
+
+
+`libuv`
+-------
+
+BIND is multithreaded, and (basically) each thread runs an event loop.
+Recent versions of BIND use [`libuv`][uv] for the event loops.
+
+A lot of things started falling into place when I realised that the
+`libuv` event loop gives BIND a [natural quiescent state][uv-loop]:
+when the event callbacks have finished running, and `libuv` is about
+to call `select()` or `poll()` or whatever, we can mark a quiescent
+state. We can require that event-handling functions do not stash root
+pointers in the heap, but only use them via local variables, so we
+know that old versions are inaccessible after the callback returns.
+
+My design marks a quiescent state once per loop, so on a busy server
+where each loop has lots to do, the cost of marking a quiescent state
+is amortized across several I/O events.
+
+[uv]: http://libuv.org/
+[uv-loop]: http://docs.libuv.org/en/v1.x/design.html#the-i-o-loop
+
+
+fuzzy barrier
+-------------
+
+So, how do we mark a quiescent state? Using a _"fuzzy barrier"_.
+
+When a thread reaches a normal barrier, it blocks until all the other
+threads have reached the barrier, after which exactly one of the
+threads can enter a protected section of code, and the others are
+unblocked and can proceed as normal.
+
+When a thread encounters a fuzzy barrier, it never blocks. It either
+proceeds immediately as normal, or if it is the last thread to reach
+the barrier, it enters the protected code.
+
+RCU does not actually use a fuzzy barrier as I have described it. Like
+a fuzzy barrier, each thread keeps track of whether it has passed
+through a quiescent state in the current grace period, without
+blocking; but unlike a fuzzy barrier, no thread is diverted to the
+protected code. Instead, code that wants to enter a protected section
+uses the blocking `synchronize_rcu()` function.
+
+
+EBR-ish
+-------
+
+As in the paper ["performance of memory reclamation for lockless
+synchronization"][HMBW], my implementation of QSBR uses a fuzzy
+barrier designed for another safe memory reclamation algorithm, EBR,
+epoch based reclamation. (EBR was invented here in Cambridge by [Keir
+Fraser][tr579].)
+
+[HMBW]: http://csng.cs.toronto.edu/publication_files/0000/0159/jpdc07.pdf
+[tr579]: https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-579.html
+
+Actually, my fuzzy barrier is slightly different to EBR's. In EBR, the
+fuzzy barrier is used every time the program enters a critical
+section. (In qp-trie terms, that would be every time a reader fetches
+a root pointer.) So it is vital that EBR's barrier avoids mutating
+shared state, because that would wreck multithreaded performance.
+
+Because BIND will only pass through the fuzzy barrier when it is about
+to use a blocking system call, my version mutates shared state more
+frequently (typically, once per CPU per grace period, instead of once
+per grace period). If this turns out to be a problem, it won't be too
+hard to make it work more like EBR.
+
+More trivially, I'm using the term "phase" instead of "epoch", because
+it's nothing to do with the unix epoch, because there are three
+phases, and because I can talk about phase transitions and threads
+being out of phase with each other.
+
+
+coda
+----
+
+While reading various RCU-related papers, I was amused by ["user-level
+implementations of read-copy update"][DMSDW], which says:
+
+> BIND, a major domain-name server used for Internet domain-name
+> resolution, is facing scalability issues. Since domain names
+> are read often but rarely updated, using user-level RCU might be
+> beneficial.
+
+Yes, I think it might :-)
+
+[DMSDW]: https://www.efficios.com/publications/
diff --git a/lib/isc/Makefile.am b/lib/isc/Makefile.am
index 6689e45c08e..8e88e693110 100644
--- a/lib/isc/Makefile.am
+++ b/lib/isc/Makefile.am
@@ -65,6 +65,7 @@ libisc_la_HEADERS =			\
 	include/isc/pause.h		\
 	include/isc/portset.h		\
 	include/isc/quota.h		\
+	include/isc/qsbr.h		\
 	include/isc/radix.h		\
 	include/isc/random.h		\
 	include/isc/ratelimiter.h	\
@@ -170,6 +171,7 @@ libisc_la_SOURCES =		\
 	picohttpparser.h	\
 	portset.c		\
 	quota.c			\
+	qsbr.c			\
 	radix.c			\
 	random.c		\
 	random_p.h		\
diff --git a/lib/isc/include/isc/loop.h b/lib/isc/include/isc/loop.h
index e0729f52759..2a85702c2ad 100644
--- a/lib/isc/include/isc/loop.h
+++ b/lib/isc/include/isc/loop.h
@@ -68,6 +68,17 @@ isc_loopmgr_run(isc_loopmgr_t *loopmgr);
  *\li	'loopmgr' is a valid loop manager.
  */
 
+void
+isc_loopmgr_wakeup(isc_loopmgr_t *loopmgr);
+/*%<
+ * Send no-op events to wake up all running loops in 'loopmgr' except
+ * the current one. (See <isc/qsbr.h>.)
+ *
+ * Requires:
+ *\li  'loopmgr' is a valid loop manager.
+ *\li  We are in a running loop.
+ */
+
 void
 isc_loopmgr_pause(isc_loopmgr_t *loopmgr);
 /*%<
diff --git a/lib/isc/include/isc/qsbr.h b/lib/isc/include/isc/qsbr.h
new file mode 100644
index 00000000000..242c6ac45bb
--- /dev/null
+++ b/lib/isc/include/isc/qsbr.h
@@ -0,0 +1,282 @@
+/*
+ * Copyright (C) Internet Systems Consortium, Inc. ("ISC")
+ *
+ * SPDX-License-Identifier: MPL-2.0
+ *
+ * This Source Code Form is subject to the terms of the Mozilla Public
+ * License, v. 2.0. If a copy of the MPL was not distributed with this
+ * file, you can obtain one at https://mozilla.org/MPL/2.0/.
+ *
+ * See the COPYRIGHT file distributed with this work for additional
+ * information regarding copyright ownership.
+ */
+
+#pragma once
+
+#include <isc/atomic.h>
+#include <isc/stack.h>
+#include <isc/types.h>
+#include <isc/uv.h>
+
+/*
+ * Quiescent state based reclamation
+ * =================================
+ *
+ * QSBR is a safe memory reclamation algorithm for lock-free data
+ * structures such as a qp-trie.
+ *
+ * When an object is unlinked from a lock-free data structure, it
+ * cannot be free()d immediately, because there can still be readers
+ * accessing the object via an old version of the data structure. SMR
+ * algorithms determine when it is safe to reclaim memory after it has
+ * been unlinked.
+ *
+ * With QSBR, reading a data structure is wait-free. All that is
+ * required is an atomic load to get the data structure's current
+ * root; there is no need to explicitly mark any read-side critical
+ * section.
+ *
+ * QSBR is used by RCU (read-copy-update) in the Linux kernel. BIND's
+ * implementation also uses some ideas from EBR (epoch-based reclamation).
+ * The following summary is based on the overview in the paper
+ * "performance of memory reclamation for lockless synchronization",
+ * (http://csng.cs.toronto.edu/publication_files/0000/0159/jpdc07.pdf).
+ *
+ * Aside: This QSBR implementation is somewhat different from the one
+ * in liburcu, described in the paper "user-level implementations of
+ * read-copy update", (https://www.efficios.com/publications/), which
+ * contains the amusing comment:
+ *
+ *	BIND, a major domain-name server used for Internet domain-name
+ *	resolution, is facing scalability issues. Since domain names
+ *	are read often but rarely updated, using user-level RCU might
+ *	be beneficial.
+ *
+ * A "quiescent state" is a point when a thread is not accessing any
+ * lock-free data structure. After passing through a quiescent state,
+ * a thread can no longer access versions of a data structure that
+ * were replaced before that point. In BIND, we use a point in the
+ * event loop (a uv_prepare_t callback) to identify a quiescent state.
+ *
+ * Aside: a prepare handle runs its callbacks before the loop sleeps,
+ * which reduces reclaim latency (unlike a check handle) and it does
+ * not affect timeout calculations (unlike an idle handle).
+ *
+ * A "grace period" is any time interval such that after the end of
+ * the grace period, all objects removed before the start of the grace
+ * period can safely be reclaimed. Different SMR algorithms detect
+ * grace periods with varying degrees of tightness or looseness.
+ *
+ * QSBR uses quiescent states to detect grace periods: a grace period
+ * is a time interval in which every thread passes through a quiescent
+ * state. (This is a safe over-estimate.) A "fuzzy barrier" is used to
+ * find out when all threads have passed through a quiescent state.
+ *
+ * NOTE: In BIND this means that code which is not running in an event
+ * loop thread (such as an isc_work / uv_work_t callback) must use
+ * locking (not lock-free) data structure accessors.
+ *
+ * Because a quiescent state happens once per event loop, a grace
+ * period takes roughly the same amount of time as the slowest event
+ * loop in each cycle.
+ *
+ * Similar to the paper linked above, this QSBR implementation uses a
+ * variant of the EBR fuzzy barrier. Like EBR, each grace period is
+ * numbered with a "phase", which cycles round 1,2,3,1,2,3,... (Phases
+ * are called epochs in EBR, but I think "phase" is a better metaphor.)
+ * When entering the fuzzy barrier, each thread updates its local phase
+ * to match the global phase, keeping a global count of the number of
+ * threads still to pass. When this count reaches zero, it is the end of
+ * the grace period; the global phase is updated and reclamation is
+ * triggered.
+ *
+ * Note that threads are usually slightly out-of-phase wrt the global
+ * grace period. At any particular point in time, there will be some
+ * threads in the current global phase, and some in the previous
+ * global phase. EBR has three phases because that is the minimum
+ * number that leaves one phase unoccupied by readers. Any objects that
+ * were detached from the data structure in the third phase can be
+ * reclaimed after the start of the current phase, because a grace
+ * period (the previous phase) has elapsed since the objects were
+ * detached.
+ *
+ * A phase number can be used by a lock-free data structure (such as a
+ * qp-trie) to record when an object was detached. QSBR calls the data
+ * structure's reclaimer function, passing a phase number indicating
+ * that objects detached in that phase can now be reclaimed
+ *
+ * In general, there will be several (maybe many) write operations
+ * during a grace period. The lock-free data structures that use QSBR
+ * will collect their reclamation work from all these writes into a
+ * batch per phase, i.e. per grace period.
+ *
+ * There is some example code in `doc/dev/qsbr.md`, with pointers to
+ * less terse introductions to QSBR and other overview material.
+ */
+
+#define ISC_QSBR_PHASE_BITS 2
+
+typedef unsigned int isc_qsbr_phase_t;
+/*%<
+ * A grace period phase number. It can be stored in a bitfield of size
+ * ISC_QSBR_PHASE_BITS. You can use zero to indicate "no phase".
+ * (Don't assume the maximum is three: We might want to increase the
+ * number of phases so that there is more than one unoccupied phase.
+ * This would allow concurrent reclamation of objects released in
+ * multiple unoccupied phases.)
+ */
+
+typedef void
+isc_qsbreclaimer_t(isc_qsbr_phase_t phase);
+/*%<
+ * The type of memory reclaimer callback functions.
+ *
+ * The `phase` identifies which objects are to be reclaimed.
+ *
+ * An isc_qsbreclaimer_t can call isc_qsbr_activate() if it could not
+ * reclaim everything and needs to be called again.
+ */
+
+typedef struct isc_qsbr_registered {
+	ISC_SLINK(struct isc_qsbr_registered) link;
+	isc_qsbreclaimer_t *func;
+} isc_qsbr_registered_t;
+/*%<
+ * Each reclaimer callback has a static `isc_qsbr_registered_t` object
+ * so that QSBR can find it.
+ */
+
+void
+isc__qsbr_register(isc_qsbr_registered_t *reg);
+/*%<
+ * Requires:
+ * \li	reclaimer->link is not linked
+ * \li	reclaimer->func is not NULL
+ */
+
+#define isc_qsbr_register(cb)                                 \
+	do {                                                  \
+		static isc_qsbr_registered_t registration = { \
+			.link = ISC_SLINK_INITIALIZER,        \
+			.func = cb,                           \
+		};                                            \
+		isc__qsbr_register(&registration);            \
+	} while (0)
+/*%<
+ * Register a callback function with QSBR. This macro should be used
+ * inside an `ISC_CONSTRUCTOR` function. There should be one callback
+ * for eack lock-free data structure implementation, which is able to
+ * reclaim all the unused memory across all instances of its data
+ * structure.
+ */
+
+isc_qsbr_phase_t
+isc_qsbr_phase(isc_loopmgr_t *loopmgr);
+/*%<
+ * Get the current phase, to use for marking detached objects.
+ *
+ * To commit a write that requires cleanup, the ordering must be:
+ *
+ * - Use atomic_store_release() to commit the data structure's new
+ *   root pointer; release ordering ensures that the interior changes
+ *   are written before the root pointer.
+ *
+ * - Call isc_qsbr_phase() to get the phase to be used for marking
+ *   objects to reclaim. This must happen after the commit, to ensure
+ *   there is at least one grace period between commit and cleanup.
+ *
+ * - Pass the same phase to isc_qsbr_activate() so that the reclaimer
+ *   will be called after a grace period has passed.
+ */
+
+void
+isc_qsbr_activate(isc_loopmgr_t *loopmgr, isc_qsbr_phase_t phase);
+/*%<
+ * Tell QSBR that objects have been detached and will need reclaiming
+ * after a grace period.
+ */
+
+/***********************************************************************
+ *
+ *  private parts
+ */
+
+/*
+ * Accessors and constructors for the `grace` variable.
+ * It contains two bit fields:
+ *
+ *   - the global phase in the lower ISC_QSBR_PHASE_BITS
+ *
+ *   - a thread counter in the upper bits
+ */
+
+#define ISC_QSBR_ONE_THREAD (1 << ISC_QSBR_PHASE_BITS)
+#define ISC_QSBR_PHASE_MAX  (ISC_QSBR_ONE_THREAD - 1)
+
+#define ISC_QSBR_GRACE_PHASE(grace)   (grace & ISC_QSBR_PHASE_MAX)
+#define ISC_QSBR_GRACE_THREADS(grace) (grace >> ISC_QSBR_PHASE_BITS)
+#define ISC_QSBR_GRACE(threads, phase) \
+	((threads << ISC_QSBR_PHASE_BITS) | phase)
+
+typedef struct isc_qsbr {
+	/*
+	 * The `grace` variable keeps track of the current grace period.
+	 * When the phase changes, the thread counter is set to the number of
+	 * threads that need to observe the new phase before the grace period
+	 * can end.
+	 *
+	 * The thread counter is an add-on to the usual EBR fuzzy barrier.
+	 * Counting threads through the barrier adds multi-thread update
+	 * contention, and in EBR the fuzzy barrier runs frequently enough
+	 * (on every access) that it's important to minimize its cost. With
+	 * QSBR, the fuzzy barrier runs less frequently (roughly, per loop,
+	 * instead of per-callback) so contention is less of a concern. The
+	 * thread counter helps to reduce reclaim latency, because unlike EBR
+	 * we don't probabilistically check, we know deterministically when
+	 * all threads have changed phase.
+	 */
+	atomic_uint_fast32_t grace;
+
+	/*
+	 * A flag for each phase indicating that there will be work to
+	 * do, so we don't invoke the reclaim machinery unnecessarily.
+	 * Set by `isc_qsbr_activate()` and cleared before the reclaimer
+	 * functions are invoked (so they can re-set their flag if
+	 * necessary).
+	 */
+	atomic_uint_fast32_t activated;
+
+	/*
+	 * The time of the last phase transition (isc_nanosecs_t). Used
+	 * to ensure that grace periods do not last forever. We use
+	 * `isc_time_monotonic()` because we need the same time in all
+	 * threads. (`uv_now()` is different in different threads.)
+	 */
+	atomic_uint_fast64_t transition_time;
+
+} isc_qsbr_t;
+
+/*
+ * When we start there is no worker thread yet, so the thread
+ * count is equal to the number of loops. The global phase starts
+ * off at one (it must always be nonzero).
+ */
+#define ISC_QSBR_INITIALIZER(nloops)                     \
+	(isc_qsbr_t) {                                   \
+		.grace = ISC_QSBR_GRACE(nloops, 1),      \
+		.transition_time = isc_time_monotonic(), \
+	}
+
+/*
+ * For use by tests that need to explicitly drive QSBR phase transitions.
+ */
+void
+isc__qsbr_quiescent_state(isc_loop_t *loop);
+
+/*
+ * Used by the loopmgr
+ */
+void
+isc__qsbr_quiescent_cb(uv_prepare_t *handle);
+void
+isc__qsbr_destroy(isc_loopmgr_t *loopmgr);
diff --git a/lib/isc/loop.c b/lib/isc/loop.c
index 81578b8dd80..61d28c3ba16 100644
--- a/lib/isc/loop.c
+++ b/lib/isc/loop.c
@@ -26,12 +26,14 @@
 #include <isc/magic.h>
 #include <isc/mem.h>
 #include <isc/mutex.h>
+#include <isc/qsbr.h>
 #include <isc/refcount.h>
 #include <isc/result.h>
 #include <isc/signal.h>
 #include <isc/strerr.h>
 #include <isc/thread.h>
 #include <isc/tid.h>
+#include <isc/time.h>
 #include <isc/util.h>
 #include <isc/uv.h>
 #include <isc/work.h>
@@ -64,8 +66,6 @@ isc_loopmgr_shutdown(isc_loopmgr_t *loopmgr) {
 		isc_loop_t *loop = &loopmgr->loops[i];
 		int r;
 
-		REQUIRE(!atomic_load(&loop->finished));
-
 		r = uv_async_send(&loop->shutdown_trigger);
 		UV_RUNTIME_CHECK(uv_async_send, r);
 	}
@@ -143,6 +143,8 @@ destroy_cb(uv_async_t *handle) {
 	uv_close(&loop->destroy_trigger, NULL);
 	uv_close(&loop->queue_trigger, NULL);
 	uv_close(&loop->pause_trigger, NULL);
+	uv_close(&loop->wakeup_trigger, NULL);
+	uv_close(&loop->quiescent, NULL);
 
 	uv_walk(&loop->loop, loop_walk_cb, (char *)"destroy_cb");
 }
@@ -153,6 +155,8 @@ shutdown_cb(uv_async_t *handle) {
 	isc_loop_t *loop = uv_handle_get_data(handle);
 	isc_loopmgr_t *loopmgr = loop->loopmgr;
 
+	loop->shuttingdown = true;
+
 	/* Make sure, we can't be called again */
 	uv_close(&loop->shutdown_trigger, shutdown_trigger_close_cb);
 
@@ -194,6 +198,12 @@ queue_cb(uv_async_t *handle) {
 	}
 }
 
+static void
+wakeup_cb(uv_async_t *handle) {
+	/* we only woke up to make the loop take a spin */
+	UNUSED(handle);
+}
+
 static void
 loop_init(isc_loop_t *loop, isc_loopmgr_t *loopmgr, uint32_t tid) {
 	*loop = (isc_loop_t){
@@ -223,6 +233,13 @@ loop_init(isc_loop_t *loop, isc_loopmgr_t *loopmgr, uint32_t tid) {
 	UV_RUNTIME_CHECK(uv_async_init, r);
 	uv_handle_set_data(&loop->destroy_trigger, loop);
 
+	r = uv_async_init(&loop->loop, &loop->wakeup_trigger, wakeup_cb);
+	UV_RUNTIME_CHECK(uv_async_init, r);
+
+	r = uv_prepare_init(&loop->loop, &loop->quiescent);
+	UV_RUNTIME_CHECK(uv_prepare_init, r);
+	uv_handle_set_data(&loop->quiescent, loop);
+
 	char name[16];
 	snprintf(name, sizeof(name), "loop-%08" PRIx32, tid);
 	isc_mem_create(&loop->mctx);
@@ -248,6 +265,9 @@ loop_run(isc_loop_t *loop) {
 		job = next;
 	}
 
+	r = uv_prepare_start(&loop->quiescent, isc__qsbr_quiescent_cb);
+	UV_RUNTIME_CHECK(uv_prepare_start, r);
+
 	isc_barrier_wait(&loop->loopmgr->starting);
 
 	r = uv_run(&loop->loop, UV_RUN_DEFAULT);
@@ -330,6 +350,7 @@ isc_loopmgr_create(isc_mem_t *mctx, uint32_t nloops, isc_loopmgr_t **loopmgrp) {
 	loopmgr = isc_mem_get(mctx, sizeof(*loopmgr));
 	*loopmgr = (isc_loopmgr_t){
 		.nloops = nloops,
+		.qsbr = ISC_QSBR_INITIALIZER(nloops),
 	};
 
 	isc_mem_attach(mctx, &loopmgr->mctx);
@@ -463,6 +484,22 @@ isc_loopmgr_run(isc_loopmgr_t *loopmgr) {
 	loop_thread(&loopmgr->loops[0]);
 }
 
+void
+isc_loopmgr_wakeup(isc_loopmgr_t *loopmgr) {
+	REQUIRE(VALID_LOOPMGR(loopmgr));
+
+	for (size_t i = 0; i < loopmgr->nloops; i++) {
+		isc_loop_t *loop = &loopmgr->loops[i];
+
+		/* Skip current loop */
+		if (i == isc_tid()) {
+			continue;
+		}
+
+		uv_async_send(&loop->wakeup_trigger);
+	}
+}
+
 void
 isc_loopmgr_pause(isc_loopmgr_t *loopmgr) {
 	REQUIRE(VALID_LOOPMGR(loopmgr));
@@ -481,7 +518,6 @@ isc_loopmgr_pause(isc_loopmgr_t *loopmgr) {
 			continue;
 		}
 
-		REQUIRE(!atomic_load(&loop->finished));
 		uv_async_send(&loop->pause_trigger);
 	}
 
diff --git a/lib/isc/loop_p.h b/lib/isc/loop_p.h
index e9e2fb58839..667f8510974 100644
--- a/lib/isc/loop_p.h
+++ b/lib/isc/loop_p.h
@@ -20,6 +20,7 @@
 #include <isc/loop.h>
 #include <isc/magic.h>
 #include <isc/mem.h>
+#include <isc/qsbr.h>
 #include <isc/refcount.h>
 #include <isc/result.h>
 #include <isc/signal.h>
@@ -52,7 +53,6 @@ struct isc_loop {
 
 	/* states */
 	bool paused;
-	atomic_bool finished;
 	bool shuttingdown;
 
 	/* Async queue */
@@ -69,6 +69,11 @@ struct isc_loop {
 
 	/* Destroy */
 	uv_async_t destroy_trigger;
+
+	/* safe memory reclamation */
+	uv_async_t wakeup_trigger;
+	uv_prepare_t quiescent;
+	isc_qsbr_phase_t qsbr_phase;
 };
 
 /*
@@ -103,6 +108,9 @@ struct isc_loopmgr {
 
 	/* per-thread objects */
 	isc_loop_t *loops;
+
+	/* safe memory reclamation */
+	isc_qsbr_t qsbr;
 };
 
 /*
diff --git a/lib/isc/qsbr.c b/lib/isc/qsbr.c
new file mode 100644
index 00000000000..c122770c143
--- /dev/null
+++ b/lib/isc/qsbr.c
@@ -0,0 +1,393 @@
+/*
+ * Copyright (C) Internet Systems Consortium, Inc. ("ISC")
+ *
+ * SPDX-License-Identifier: MPL-2.0
+ *
+ * This Source Code Form is subject to the terms of the Mozilla Public
+ * License, v. 2.0. If a copy of the MPL was not distributed with this
+ * file, you can obtain one at https://mozilla.org/MPL/2.0/.
+ *
+ * See the COPYRIGHT file distributed with this work for additional
+ * information regarding copyright ownership.
+ */
+
+#include <isc/atomic.h>
+#include <isc/log.h>
+#include <isc/loop.h>
+#include <isc/qsbr.h>
+#include <isc/stack.h>
+#include <isc/tid.h>
+#include <isc/time.h>
+#include <isc/types.h>
+#include <isc/uv.h>
+
+#include "loop_p.h"
+
+#define MAX_GRACE_PERIOD_NS 53 * NS_PER_MS
+
+#if 0
+#define TRACE(fmt, ...)                                                       \
+	isc_log_write(isc_lctx, ISC_LOGCATEGORY_GENERAL, ISC_LOGMODULE_OTHER, \
+		      ISC_LOG_DEBUG(7), "%s:%u:%s():t%u: " fmt, __FILE__,     \
+		      __LINE__, __func__, isc_tid(), ##__VA_ARGS__)
+#else
+#define TRACE(...)
+#endif
+
+static ISC_STACK(isc_qsbr_registered_t) qsbreclaimers = ISC_STACK_INITIALIZER;
+
+static void
+reclaim_cb(void *arg);
+static void
+reclaimed_cb(void *arg);
+
+/**********************************************************************/
+
+/*
+ * 3,2,1,3,2,1,...
+ */
+static isc_qsbr_phase_t
+change_phase(isc_qsbr_phase_t phase) {
+	return (--phase > 0 ? phase : ISC_QSBR_PHASE_MAX);
+}
+
+/*
+ * For marking or checking that a phase has cleanup work to do.
+ */
+static unsigned int
+active_bit(isc_qsbr_phase_t phase) {
+	return (1 << phase);
+}
+
+/*
+ * Extract the global phase from the grace period state.
+ */
+static isc_qsbr_phase_t
+global_phase(isc_qsbr_t *qsbr, memory_order m_o) {
+	uint32_t grace = atomic_load_explicit(&qsbr->grace, m_o);
+	return (ISC_QSBR_GRACE_PHASE(grace));
+}
+
+/*
+ * Record that the current thread has passed the barrier.
+ * Returns true if more threads still need to pass.
+ *
+ * ATOMIC: acquire-release, to ensure that this is not reordered wrt
+ * read-only accesses to lock-free data structures. This implements the
+ * ordering requirements of a quiescent state.
+ */
+static bool
+fuzzy_barrier_not_yet(isc_qsbr_t *qsbr) {
+	uint32_t grace = atomic_fetch_sub_acq_rel(&qsbr->grace,
+						  ISC_QSBR_ONE_THREAD);
+	uint32_t threads = ISC_QSBR_GRACE_THREADS(grace);
+	return (threads > 1);
+}
+
+/*
+ * Ungracefully drive all cleanup work to completion.
+ *
+ * ATOMIC: everything is relaxed, because we assume that concurrent
+ * readers have already finished. `reclaim_cb()` uses the `activated`
+ * flags to ensure it is OK that threads will race to complete the
+ * cleanup.
+ */
+static void
+qsbr_shutdown(isc_loopmgr_t *loopmgr) {
+	isc_qsbr_t *qsbr = &loopmgr->qsbr;
+	isc_qsbr_phase_t phase = global_phase(qsbr, memory_order_relaxed);
+	uint32_t threads = isc_loopmgr_nloops(loopmgr);
+	uint32_t grace;
+
+	while (atomic_load_relaxed(&qsbr->activated) != 0) {
+		reclaim_cb(loopmgr);
+		phase = change_phase(phase);
+		grace = ISC_QSBR_GRACE(threads, phase);
+		atomic_store_relaxed(&qsbr->grace, grace);
+	}
+}
+
+/*
+ * On a quiet server that does not have enough network traffic to keep
+ * all its threads spinning, grace periods might extend indefinitely.
+ * So check if we have been waiting an unreasonably long time since
+ * the last phase change. If so, send a no-op async request to every
+ * thread to make them all cycle through a quiescent state.
+ */
+static void
+maybe_wakeup(isc_loop_t *loop) {
+	isc_loopmgr_t *loopmgr = loop->loopmgr;
+	isc_qsbr_t *qsbr = &loopmgr->qsbr;
+
+	/*
+	 * ATOMIC: relaxed is OK here because we don't use any values guarded
+	 * by the `activated` flags.
+	 */
+	if (atomic_load_relaxed(&qsbr->activated) == 0) {
+		return;
+	}
+	if (loop->shuttingdown) {
+		qsbr_shutdown(loopmgr);
+		return;
+	}
+
+	/*
+	 * ATOMIC: relaxed, because the `transition_time` doesn't guard any
+	 * other values, just the isc_loopmgr_wakeup() call below.
+	 */
+	atomic_uint_fast64_t *qsbr_ttp = &qsbr->transition_time;
+	isc_nanosecs_t now = isc_time_monotonic();
+	isc_nanosecs_t start = atomic_load_relaxed(qsbr_ttp);
+	if (now < start + MAX_GRACE_PERIOD_NS) {
+		return;
+	}
+
+	/*
+	 * To stop other threads from also invoking `isc_loopmgr_wakeup()`,
+	 * we try to push the timer into the future (expecting that it will
+	 * not trigger again), and quit if someone else got there first.
+	 * ATOMIC: relaxed, as before; strong, because there is no retry loop.
+	 */
+	if (!atomic_compare_exchange_strong_relaxed(qsbr_ttp, &start, now)) {
+		return;
+	}
+
+	TRACE("long grace period of %llu ns, waking up other threads",
+	      (unsigned long long)(now - start));
+
+	isc_loopmgr_wakeup(loopmgr);
+}
+
+/*
+ * Callers use the fuzzy barrier to ensure only one thread can enter
+ * this function at a time.
+ *
+ * Phase transitions happen at roughly the same frequency that IO
+ * event loops cycle, limited by the slowest loop in each cycle.
+ */
+static void
+phase_transition(isc_loop_t *loop, isc_qsbr_phase_t current_phase) {
+	isc_loopmgr_t *loopmgr = loop->loopmgr;
+	isc_qsbr_t *qsbr = &loopmgr->qsbr;
+
+	if (loop->shuttingdown) {
+		qsbr_shutdown(loopmgr);
+		return;
+	}
+
+	/*
+	 * After we change phase, threads will be in either the `current_phase`
+	 * or the `next_phase`. We will reclaim memory from the `third_phase`.
+	 *
+	 * ATOMIC: relaxed is OK here because the necessary synchronization
+	 * happens in `reclaim_cb()`.
+	 */
+	isc_qsbr_phase_t next_phase = change_phase(current_phase);
+	isc_qsbr_phase_t third_phase = change_phase(next_phase);
+	bool activated = atomic_load_relaxed(&qsbr->activated) &
+			 active_bit(third_phase);
+
+	/*
+	 * Reset the wakeup timer, and log the length of the grace period.
+	 * ATOMIC: relaxed, per the commentary in `maybe_wakeup()`.
+	 */
+	atomic_uint_fast64_t *qsbr_tt = &qsbr->transition_time;
+	isc_nanosecs_t now = isc_time_monotonic();
+	isc_nanosecs_t start = atomic_exchange_relaxed(qsbr_tt, now);
+	TRACE("phase %u -> %u after grace period of %f ms", current_phase,
+	      next_phase, (double)(now - start) / NS_PER_MS);
+	UNUSED(start); /* ifndef TRACE() */
+
+	/*
+	 * Work out the threads counter for this grace period.
+	 *
+	 * We need to add one for any reclamation worker thread, to
+	 * prevent us from changing phase before the work is done. If
+	 * we change too early, any newly detached objects will be
+	 * marked with the same phase as the running reclaimer, which
+	 * might lead to them being free()d too soon.
+	 */
+	uint32_t threads = isc_loopmgr_nloops(loopmgr) + (activated ? 1 : 0);
+
+	/*
+	 * Start the new grace period.
+	 *
+	 * ATOMIC: release, to pair with the load-acquire in `reclaim_cb()`
+	 * which is spawned in a separate worker thread.
+	 */
+	uint32_t grace = ISC_QSBR_GRACE(threads, next_phase);
+	atomic_store_release(&qsbr->grace, grace);
+
+	if (activated) {
+		isc_work_enqueue(loop, reclaim_cb, reclaimed_cb, loopmgr);
+	}
+}
+
+/*
+ * This function is called once per cycle of each IO event loop by the
+ * `uv_prepare` callback below.
+ */
+void
+isc__qsbr_quiescent_state(isc_loop_t *loop) {
+	isc_loopmgr_t *loopmgr = loop->loopmgr;
+	isc_qsbr_t *qsbr = &loopmgr->qsbr;
+
+	/*
+	 * ATOMIC: relaxed. If we are in phase then we don't need to
+	 * synchronize; if we are not then this thread's presence in
+	 * the thread counter will prevent the phase from changing
+	 * before we get to the fuzzy barrier.
+	 */
+	isc_qsbr_phase_t phase = global_phase(qsbr, memory_order_relaxed);
+	if (loop->qsbr_phase == phase) {
+		maybe_wakeup(loop);
+		return;
+	}
+
+	/*
+	 * Enter the current phase and count us out of the previous phase.
+	 */
+	loop->qsbr_phase = phase;
+	if (fuzzy_barrier_not_yet(qsbr)) {
+		maybe_wakeup(loop);
+		return;
+	}
+
+	/*
+	 * We were the last thread to enter the current phase so the
+	 * grace period is up. No other thread can reach this point.
+	 */
+	phase_transition(loop, phase);
+}
+
+void
+isc__qsbr_quiescent_cb(uv_prepare_t *handle) {
+	isc_loop_t *loop = uv_handle_get_data((uv_handle_t *)handle);
+	isc__qsbr_quiescent_state(loop);
+}
+
+static void
+reclaimed_cb(void *arg) {
+	/* we are back on a loop thread */
+	isc_loopmgr_t *loopmgr = arg;
+	isc_qsbr_t *qsbr = &loopmgr->qsbr;
+	isc_loop_t *loop = CURRENT_LOOP(loopmgr);
+
+	/*
+	 * Remove the reclaimers from the thread count, so that the
+	 * next grace period can start.
+	 */
+	if (fuzzy_barrier_not_yet(qsbr)) {
+		return;
+	}
+
+	/*
+	 * The reclaimers were the last thread to be counted out: every
+	 * other thread already passed through a quiescent state.
+	 *
+	 * We expect loop->qsbr_phase == global_phase() at this point,
+	 * except during shutdown when the phase shifts rapidly. Also,
+	 * the current loop might not have received the shutdown
+	 * message yet, so it seems easiest to omit the assertion.
+	 *
+	 * ATOMIC: relaxed, the fuzzy barrier already synchronized.
+	 */
+	TRACE("reclaimers overran");
+	phase_transition(loop, global_phase(qsbr, memory_order_relaxed));
+}
+
+static void
+reclaim_cb(void *arg) {
+	/* we are on a work thread not a loop thread */
+	isc_loopmgr_t *loopmgr = arg;
+	isc_qsbr_t *qsbr = &loopmgr->qsbr;
+
+	/*
+	 * The global phase has just been bumped by a `phase_transition()`
+	 * and it cannot change again until the grace period is up, which
+	 * cannot happen until we have finished working.
+	 *
+	 * ATOMIC: acquire, to pair with the release in `phase_transition()`.
+	 *
+	 * The phase we are to clean up is 2 before the current phase,
+	 * which is the same as the one after the current phase (mod 3).
+	 */
+	isc_qsbr_phase_t cur_phase = global_phase(qsbr, memory_order_acquire);
+	isc_qsbr_phase_t third_phase = change_phase(cur_phase);
+	unsigned int third_bit = active_bit(third_phase);
+
+	/*
+	 * If any reclaimers need to be called again later, they can use
+	 * `isc_qsbr_activate()`, so we need to clear the bit first.
+	 *
+	 * ATOMIC: acquire, so that `isc_qsbr_activate()` happens before
+	 * the callbacks are invoked.
+	 */
+	uint32_t activated = atomic_fetch_and_explicit(
+		&qsbr->activated, ~third_bit, memory_order_acquire);
+
+	/* this can happen when we are racing to clean up on shutdown */
+	if ((activated & third_bit) == 0) {
+		return;
+	}
+
+	isc_qsbr_registered_t *reclaimer = ISC_STACK_TOP(qsbreclaimers);
+	while (reclaimer != NULL) {
+		reclaimer->func(third_phase);
+		reclaimer = ISC_SLINK_NEXT(reclaimer, link);
+	}
+}
+
+void
+isc__qsbr_register(isc_qsbr_registered_t *reclaimer) {
+	REQUIRE(reclaimer->func != NULL);
+	ISC_STACK_PUSH(qsbreclaimers, reclaimer, link);
+}
+
+/*
+ * ATOMIC: This function needs to ensure that the global phase is read
+ * after a write has committed. Acquire/release ordering is not sufficient
+ * for ordering between separate atomics (the data structure's root pointer
+ * and the global phase), so it must be sequentially consistent.
+ *
+ * In general, the phases up to and including the next phase transition
+ * look like:
+ *
+ * 1. local phase
+ * 2. global phase
+ * 3. next phase
+ * 1. third phase
+ *
+ * i.e. some threads are still one behind the global phase, on the same
+ * phase that will be cleaned up immediately after the phase transition.
+ *
+ * This function is called just after a write commits. It's likely that
+ * some threads on the global phase (2) are using a version of the data
+ * structure from before the write, and they can continue using it while
+ * the straggler threads (1) catch up and cause a phase transition.
+ *
+ * The writer can be one of the straggler threads. If it incorrectly marks
+ * cleanup work with its local phase (1), memory will be reclaimed
+ * immediately after the next phase transition (when the third phase is
+ * also 1), which could be almost immediately when the writer returns to
+ * the event loop. This will cause a use-after-free for existing readers
+ * (in phase 2).
+ *
+ * More straightforwardly, we need to be able to queue up reclaim work from
+ * a thread that isn't running a loop, which also means this function has
+ * to return the global phase.
+ */
+isc_qsbr_phase_t
+isc_qsbr_phase(isc_loopmgr_t *loopmgr) {
+	isc_qsbr_t *qsbr = &loopmgr->qsbr;
+	return (global_phase(qsbr, memory_order_seq_cst));
+}
+
+void
+isc_qsbr_activate(isc_loopmgr_t *loopmgr, isc_qsbr_phase_t phase) {
+	/*
+	 * ATOMIC: release ordering ensures that writing the cleanup lists
+	 * happens before the callback is invoked from a worker thread.
+	 */
+	atomic_fetch_or_release(&loopmgr->qsbr.activated, active_bit(phase));
+}