From: Tony Finch Date: Thu, 29 Dec 2022 19:18:00 +0000 (+0000) Subject: QSBR: safe memory reclamation for lock-free data structures X-Git-Tag: v9.19.11~32^2 X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=9b7aa536ba7219d4d113d9cc4030371a88ce35c6;p=thirdparty%2Fbind9.git QSBR: safe memory reclamation for lock-free data structures This "quiescent state based reclamation" module provides support for the qp-trie module in dns/qp. It is a replacement for liburcu, written without reference to the urcu source code, and in fact it works in a significantly different way. A few specifics of BIND make this variant of QSBR somewhat simpler: * We can require that wait-free access to a qp-trie only happens in an isc_loop callback. The loop provides a natural quiescent state, after the callbacks are done, when no qp-trie access occurs. * We can dispense with any API like rcu_synchronize(). In practice, it takes far too long to wait for a grace period to elapse for each write to a data structure. * We use the idea of "phases" (aka epochs or eras) from EBR to reduce the amount of bookkeeping needed to track memory that is no longer needed, knowing that the qp-trie does most of that work already. I considered hazard pointers for safe memory reclamation. They have more read-side overhead (updating the hazard pointers) and it wasn't clear to me how to nicely schedule the cleanup work. Another alternative, epoch-based reclamation, is designed for fine-grained lock-free updates, so it needs some rethinking to work well with the heavily read-biased design of the qp-trie. QSBR has the fastest read side of the basic SMR algorithms (with no barriers), and fits well into a libuv loop. More recent hybrid SMR algorithms do not appear to have enough benefits to justify the extra complexity. --- diff --git a/CHANGES b/CHANGES index 2f23b7a7d09..620b8b886d6 100644 --- a/CHANGES +++ b/CHANGES @@ -1,3 +1,6 @@ +6109. [func] Infrastructure for QSBR, asynchronous safe memory + reclamation for lock-free data structures. [GL !7471] + 6108. [func] Support for simple lock-free singly-linked stacks. [GL !7470] diff --git a/doc/dev/qsbr.md b/doc/dev/qsbr.md new file mode 100644 index 00000000000..7880a56bd41 --- /dev/null +++ b/doc/dev/qsbr.md @@ -0,0 +1,397 @@ + + +QSBR: quiescent state based reclamation +======================================= + +QSBR is a safe memory reclamation (SMR) algorithm for lock-free data +structures such as a qp-trie. (See `doc/dev/qp.md`.) + +When an object is unlinked from a lock-free data structure, it +cannot be `free()`ed immediately, because there can still be readers +accessing the object via an old version of the data structure. SMR +algorithms determine when it is safe to reclaim memory after it has +been unlinked. + + +Introductions and overviews +--------------------------- + +There is a terse overview in `include/isc/qsbr.h`. + +Jeff Preshing has a nice introduction to QSBR, +__ + +At the end of this note is a copy of a blog post about writing BIND's +`isc_qsbr`, __ + +[Paul McKenney's web page][paulmck] has links to his book on +concurrent programming, the [Userspace RCU library][urcu], and more. +McKenney invented RCU and QSBR. RCU is the Linux kernel's machinery +for lock-free data structures and safe memory reclamation, based on +QSBR. + +[paulmck]: http://www.rdrop.com/~paulmck/ +[urcu]: https://liburcu.org/ + + +Example code +------------ + +If you are implementing a lock-free data structure that needs safe +memory reclamation, here's a guide to using `isc_qsbr`, based on how +QSBR is used by `dns_qp`. + +### registration + +When the program starts up you need to register a global callback +function that will reclaim unused memory. You can do so using an +ISC_CONSTRUCTOR function that runs automatically at startup. + + static void + qp_qsbr_register(void) ISC_CONSTRUCTOR; + static void + qp_qsbr_register(void) { + isc_qsbr_register(qp_qsbr_reclaimer); + } + +### work list + +Your module will need somewhere that your callback can find the work +it needs to do. The qp-trie has an atomic list of `dns_qpmulti_t` +objects for this purpose. + + /* a global variable */ + static ISC_ASTACK(dns_qpmulti_t) qsbr_work; + +The reason for using global variables is so that we don't need to +allocate a thunk every time we have memory reclamation work to do. + +### read-only access + +You should design your data structure so that it has a single atomic +root pointer referring to its current version. A lock-free reader +_must_ run in an `isc_loop` callback. It gains access to the data +structure by taking a copy of this pointer: + + qp_node_t *reader = atomic_load_acquire(&multi->reader); + +During an `isc_loop` callback, a reader should keep using the same +pointer go get a consistent view of the data structure. If it reloads +the pointer it can get a different version changed by concurrent +writers. + +A reader _must_ stop using the root pointer and any interior pointers +obtained via the root pointer before it returns to the `isc_loop`. + +### modifications and writes + +All changes to the data structure must be copy-on-write (aka +read-copy-update) so that concurrent readers are not disturbed. + +When a new version of the data structure has been prepared, it is +committed by overwriting the atomic root pointer, + + atomic_store_release(&multi->reader, reader); /* COMMIT */ + +### scheduling cleanup + +After committing a change, your data structure may have memory that +will become free, after concurrent readers have stopped accessing it. +To reclaim the memory when it is safe, use code like: + + isc_qsbr_phase_t phase = isc_qsbr_phase(multi->loopmgr); + if (defer_chunk_reclamation(qp, phase)) { + ISC_ASTACK_ADD(qsbr_work, multi, cleanup); + isc_qsbr_activate(multi->loopmgr, phase); + } + + * First, get the current QSBR phase + + * Second, mark free memory with the phase number. The qp-trie scans + its chunks and marks those that will become free, and returns + `true` if there is cleanup work to do. + + * If so, the qp-trie is added to the work list. (`ISC_ALIST_ADD()` + is idempotent). + + * Finally, QSBR is informed that there is work to do. + +In other cases it might not make sense to scan the data structure +after committing, and instead you might make note of which memory to +clean up while making changes before you know what the phase will be. +You can then have per-phase work lists, like: + + static ISC_ASTACK(my_work_t) qsbr_work[ISC_QSBR_PHASES]; + + isc_qsbr_phase_t phase = isc_qsbr_phase(loopmgr); + ISC_ASTACK_ADD(qsbr_work[phase], cleanup_work, link); + isc_qsbr_activate(loopmgr, phase); + +In general, there will be several (maybe many) write operations during +a grace period. Your lock-free data structure should collect its +reclamation work from all these writes into a batch per phase, i.e. +per grace period. + +### reclaiming + +Inside the reclaimer callback, we iterate over the work list and clean +up each item on it. If there is more cleanup work to do in another +phase, we put the qp-trie back on the work list for another go. + + static void + qsbreclaimer(void *arg, isc_qsbr_phase_t phase) { + UNUSED(arg); + + ISC_STACK(dns_qpmulti_t) drain = ISC_ASTACK_TO_STACK(qsbr_work); + while (!ISC_STACK_EMPTY(drain)) { + dns_qpmulti_t *multi = ISC_STACK_POP(drain, cleanup); + INSIST(QPMULTI_VALID(multi)); + LOCK(&multi->mutex); + if (reclaim_chunks(&multi->writer, phase)) { + /* more to do next time */ + ISC_ALIST_PUSH(qsbr_work, multi, cleanup); + } + UNLOCK(&multi->mutex); + } + } + +### reclaim marks + +In the qp-trie data structure, each chunk has some metadata which +includes a bitfield for the reclaim phase: + + isc_qsbr_phase_t phase : ISC_QSBR_PHASE_BITS; + +We use a bitfield so that all the metadata fits in a single word. + + +------------------------------------------------------------------------ + +Safe memory reclamation for BIND +================================ + +At the end of October 2022, I _finally_ got [my multithreaded +qp-trie][qp-gc] working! It could be built with two different +concurrency control mechanisms: + + * A reader/writer lock + + This has poor read-side scalability, because every thread is + hammering on the same shared location. But its write performance + is reasonably good: concurrent readers don't slow it down too much. + + * [`liburcu`, userland read-copy-update][urcu] + + RCU has a fast and scalable read side, nice! But on the write side + I used `synchronize_rcu()`, which is blocking and rather slow, so + my write performance was terrible. + +OK, but I want the best of both worlds! To fix it, I needed to change +the qp-trie code to use safe memory reclamation more effectively: +instead of blocking inside `synchronize_rcu()` before cleaning up, use +`call_rcu()` to clean up asynchronously. I expect I'll write about the +qp-trie changes another time. + +Another issue is that I want the best of both worlds _by default_, +but `liburcu` is [LGPL][] and we don't want BIND to depend on +code whose licence demands more from our users than the [MPL][]. + +[qp-gc]: https://dotat.at/@/2021-06-23-page-based-gc-for-qp-trie-rcu.html +[LGPL]: https://opensource.org/licenses/LGPL-2.1 +[MPL]: https://opensource.org/licenses/MPL-2.0 + +So I set out to write my own safe memory reclamation support code. + + +lock freedom +------------ + +In a [multithreaded qp-trie][qp-gc], there can be many concurrent +readers, but there can be only one writer at a time and modifications +are strictly serialized. When I have got it working properly, readers +are completely wait-free, unaffected by other readers, and almost +unaffected by writers. Writers need to get a mutex to ensure there is +only one at a time, but once the mutex is acquired, a writer is not +obstructed by readers. + +The way this works is that readers use an atomic load to get a pointer +to the root of the current version of the trie. Readers can make +multiple queries using this root pointer and the results will be +consistent wrt that particular version, regardless of what changes +writers might be making concurrently. Writers do not affect readers +because all changes are made by copy-on-write. When a writer is ready +to commit a new version of the trie, it uses an atomic store to flip +the root pointer. + + +safe memory reclamation +----------------------- + +We can't copy-on-write indefinitely: we need to reclaim the memory +used by old versions of the trie. And we must do so "safely", i.e. +without `free()`ing memory that readers are still using. + +So, before `free()`ing memory, a writer must wait for a _"grace +period"_, which is a jargon term meaning "until readers are not using +the old version". There are a bunch of algorithms for determining when +a grace period is over, with varying amounts of over-approximation, +CPU overhead, and memory backlog. + +The [RCU][urcu] function `synchronize_rcu()` is slow because it blocks +waiting for a grace period; the `call_rcu()` function runs a callback +asynchronously after a grace period has passed. I wanted to avoid +blocking my writers, so I needed to implement something like +`call_rcu()`. + + +aversions +--------- + +When I started trying to work out how to do safe memory reclamation, +it all seemed quite intimidating. But as I learned more, I found that +my circumstances make it easier than it appeared at first. + +The [`liburcu`][urcu] homepage has a long list of supported CPU +architectures and operating systems. Do I have to care about those +details too? No! The RCU code dates back to before the age of +standardized concurrent memory models, so the RCU developers had to +invent their own atomic primitives and correctness rules. Twenty-ish +years later the state of the art has advanced, so I can use +`` without having to re-do it like `liburcu`. + +You can also choose between several algorithms implemented by +[`liburcu`][urcu], involving questions about kernel support, specially +reserved signals, and intrusiveness in application code. But while I +was working out how to schedule asynchronous memory reclamation work, +I realised that BIND is already well-suited to the fastest flavour of +RCU, called "QSBR". + + +QSBR +---- + +QSBR stands for "quiescent state based reclamation". A _"quiescent +state"_ is a fancy name for a point when a thread is not accessing a +lock-free data structure, and does not retain any root pointers or +interior pointers. + +When a thread has passed through a quiescent state, it no longer has +access to older versions of the data structures. When _all_ threads +have passed through quiescent states, then nothing in the program has +access to old versions. This is how QSBR detects grace periods: after +a writer commits a new version, it waits for all threads to pass +through quiescent states, and therefore a grace period has definitely +elapsed, and so it is then safe to reclaim the old version's memory. + +QSBR is fast because readers do not need to explicitly mark the +critical section surrounding the atomic load that I mentioned earlier. +Threads just need to pass through a quiescent state frequently enough +that there isn't a huge build-up of unreclaimed memory. + +Inside an operating system kernel (RCU's native environment), a +context switch provides a natural quiescent state. In a userland +application, you need to find a good place to call +`rcu_quiescent_state()`. You could call it every time you have +finished using a root pointer, but marking a quiescent state is not +completely free, so there are probably more efficient ways. + + +`libuv` +------- + +BIND is multithreaded, and (basically) each thread runs an event loop. +Recent versions of BIND use [`libuv`][uv] for the event loops. + +A lot of things started falling into place when I realised that the +`libuv` event loop gives BIND a [natural quiescent state][uv-loop]: +when the event callbacks have finished running, and `libuv` is about +to call `select()` or `poll()` or whatever, we can mark a quiescent +state. We can require that event-handling functions do not stash root +pointers in the heap, but only use them via local variables, so we +know that old versions are inaccessible after the callback returns. + +My design marks a quiescent state once per loop, so on a busy server +where each loop has lots to do, the cost of marking a quiescent state +is amortized across several I/O events. + +[uv]: http://libuv.org/ +[uv-loop]: http://docs.libuv.org/en/v1.x/design.html#the-i-o-loop + + +fuzzy barrier +------------- + +So, how do we mark a quiescent state? Using a _"fuzzy barrier"_. + +When a thread reaches a normal barrier, it blocks until all the other +threads have reached the barrier, after which exactly one of the +threads can enter a protected section of code, and the others are +unblocked and can proceed as normal. + +When a thread encounters a fuzzy barrier, it never blocks. It either +proceeds immediately as normal, or if it is the last thread to reach +the barrier, it enters the protected code. + +RCU does not actually use a fuzzy barrier as I have described it. Like +a fuzzy barrier, each thread keeps track of whether it has passed +through a quiescent state in the current grace period, without +blocking; but unlike a fuzzy barrier, no thread is diverted to the +protected code. Instead, code that wants to enter a protected section +uses the blocking `synchronize_rcu()` function. + + +EBR-ish +------- + +As in the paper ["performance of memory reclamation for lockless +synchronization"][HMBW], my implementation of QSBR uses a fuzzy +barrier designed for another safe memory reclamation algorithm, EBR, +epoch based reclamation. (EBR was invented here in Cambridge by [Keir +Fraser][tr579].) + +[HMBW]: http://csng.cs.toronto.edu/publication_files/0000/0159/jpdc07.pdf +[tr579]: https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-579.html + +Actually, my fuzzy barrier is slightly different to EBR's. In EBR, the +fuzzy barrier is used every time the program enters a critical +section. (In qp-trie terms, that would be every time a reader fetches +a root pointer.) So it is vital that EBR's barrier avoids mutating +shared state, because that would wreck multithreaded performance. + +Because BIND will only pass through the fuzzy barrier when it is about +to use a blocking system call, my version mutates shared state more +frequently (typically, once per CPU per grace period, instead of once +per grace period). If this turns out to be a problem, it won't be too +hard to make it work more like EBR. + +More trivially, I'm using the term "phase" instead of "epoch", because +it's nothing to do with the unix epoch, because there are three +phases, and because I can talk about phase transitions and threads +being out of phase with each other. + + +coda +---- + +While reading various RCU-related papers, I was amused by ["user-level +implementations of read-copy update"][DMSDW], which says: + +> BIND, a major domain-name server used for Internet domain-name +> resolution, is facing scalability issues. Since domain names +> are read often but rarely updated, using user-level RCU might be +> beneficial. + +Yes, I think it might :-) + +[DMSDW]: https://www.efficios.com/publications/ diff --git a/lib/isc/Makefile.am b/lib/isc/Makefile.am index 6689e45c08e..8e88e693110 100644 --- a/lib/isc/Makefile.am +++ b/lib/isc/Makefile.am @@ -65,6 +65,7 @@ libisc_la_HEADERS = \ include/isc/pause.h \ include/isc/portset.h \ include/isc/quota.h \ + include/isc/qsbr.h \ include/isc/radix.h \ include/isc/random.h \ include/isc/ratelimiter.h \ @@ -170,6 +171,7 @@ libisc_la_SOURCES = \ picohttpparser.h \ portset.c \ quota.c \ + qsbr.c \ radix.c \ random.c \ random_p.h \ diff --git a/lib/isc/include/isc/loop.h b/lib/isc/include/isc/loop.h index e0729f52759..2a85702c2ad 100644 --- a/lib/isc/include/isc/loop.h +++ b/lib/isc/include/isc/loop.h @@ -68,6 +68,17 @@ isc_loopmgr_run(isc_loopmgr_t *loopmgr); *\li 'loopmgr' is a valid loop manager. */ +void +isc_loopmgr_wakeup(isc_loopmgr_t *loopmgr); +/*%< + * Send no-op events to wake up all running loops in 'loopmgr' except + * the current one. (See .) + * + * Requires: + *\li 'loopmgr' is a valid loop manager. + *\li We are in a running loop. + */ + void isc_loopmgr_pause(isc_loopmgr_t *loopmgr); /*%< diff --git a/lib/isc/include/isc/qsbr.h b/lib/isc/include/isc/qsbr.h new file mode 100644 index 00000000000..242c6ac45bb --- /dev/null +++ b/lib/isc/include/isc/qsbr.h @@ -0,0 +1,282 @@ +/* + * Copyright (C) Internet Systems Consortium, Inc. ("ISC") + * + * SPDX-License-Identifier: MPL-2.0 + * + * This Source Code Form is subject to the terms of the Mozilla Public + * License, v. 2.0. If a copy of the MPL was not distributed with this + * file, you can obtain one at https://mozilla.org/MPL/2.0/. + * + * See the COPYRIGHT file distributed with this work for additional + * information regarding copyright ownership. + */ + +#pragma once + +#include +#include +#include +#include + +/* + * Quiescent state based reclamation + * ================================= + * + * QSBR is a safe memory reclamation algorithm for lock-free data + * structures such as a qp-trie. + * + * When an object is unlinked from a lock-free data structure, it + * cannot be free()d immediately, because there can still be readers + * accessing the object via an old version of the data structure. SMR + * algorithms determine when it is safe to reclaim memory after it has + * been unlinked. + * + * With QSBR, reading a data structure is wait-free. All that is + * required is an atomic load to get the data structure's current + * root; there is no need to explicitly mark any read-side critical + * section. + * + * QSBR is used by RCU (read-copy-update) in the Linux kernel. BIND's + * implementation also uses some ideas from EBR (epoch-based reclamation). + * The following summary is based on the overview in the paper + * "performance of memory reclamation for lockless synchronization", + * (http://csng.cs.toronto.edu/publication_files/0000/0159/jpdc07.pdf). + * + * Aside: This QSBR implementation is somewhat different from the one + * in liburcu, described in the paper "user-level implementations of + * read-copy update", (https://www.efficios.com/publications/), which + * contains the amusing comment: + * + * BIND, a major domain-name server used for Internet domain-name + * resolution, is facing scalability issues. Since domain names + * are read often but rarely updated, using user-level RCU might + * be beneficial. + * + * A "quiescent state" is a point when a thread is not accessing any + * lock-free data structure. After passing through a quiescent state, + * a thread can no longer access versions of a data structure that + * were replaced before that point. In BIND, we use a point in the + * event loop (a uv_prepare_t callback) to identify a quiescent state. + * + * Aside: a prepare handle runs its callbacks before the loop sleeps, + * which reduces reclaim latency (unlike a check handle) and it does + * not affect timeout calculations (unlike an idle handle). + * + * A "grace period" is any time interval such that after the end of + * the grace period, all objects removed before the start of the grace + * period can safely be reclaimed. Different SMR algorithms detect + * grace periods with varying degrees of tightness or looseness. + * + * QSBR uses quiescent states to detect grace periods: a grace period + * is a time interval in which every thread passes through a quiescent + * state. (This is a safe over-estimate.) A "fuzzy barrier" is used to + * find out when all threads have passed through a quiescent state. + * + * NOTE: In BIND this means that code which is not running in an event + * loop thread (such as an isc_work / uv_work_t callback) must use + * locking (not lock-free) data structure accessors. + * + * Because a quiescent state happens once per event loop, a grace + * period takes roughly the same amount of time as the slowest event + * loop in each cycle. + * + * Similar to the paper linked above, this QSBR implementation uses a + * variant of the EBR fuzzy barrier. Like EBR, each grace period is + * numbered with a "phase", which cycles round 1,2,3,1,2,3,... (Phases + * are called epochs in EBR, but I think "phase" is a better metaphor.) + * When entering the fuzzy barrier, each thread updates its local phase + * to match the global phase, keeping a global count of the number of + * threads still to pass. When this count reaches zero, it is the end of + * the grace period; the global phase is updated and reclamation is + * triggered. + * + * Note that threads are usually slightly out-of-phase wrt the global + * grace period. At any particular point in time, there will be some + * threads in the current global phase, and some in the previous + * global phase. EBR has three phases because that is the minimum + * number that leaves one phase unoccupied by readers. Any objects that + * were detached from the data structure in the third phase can be + * reclaimed after the start of the current phase, because a grace + * period (the previous phase) has elapsed since the objects were + * detached. + * + * A phase number can be used by a lock-free data structure (such as a + * qp-trie) to record when an object was detached. QSBR calls the data + * structure's reclaimer function, passing a phase number indicating + * that objects detached in that phase can now be reclaimed + * + * In general, there will be several (maybe many) write operations + * during a grace period. The lock-free data structures that use QSBR + * will collect their reclamation work from all these writes into a + * batch per phase, i.e. per grace period. + * + * There is some example code in `doc/dev/qsbr.md`, with pointers to + * less terse introductions to QSBR and other overview material. + */ + +#define ISC_QSBR_PHASE_BITS 2 + +typedef unsigned int isc_qsbr_phase_t; +/*%< + * A grace period phase number. It can be stored in a bitfield of size + * ISC_QSBR_PHASE_BITS. You can use zero to indicate "no phase". + * (Don't assume the maximum is three: We might want to increase the + * number of phases so that there is more than one unoccupied phase. + * This would allow concurrent reclamation of objects released in + * multiple unoccupied phases.) + */ + +typedef void +isc_qsbreclaimer_t(isc_qsbr_phase_t phase); +/*%< + * The type of memory reclaimer callback functions. + * + * The `phase` identifies which objects are to be reclaimed. + * + * An isc_qsbreclaimer_t can call isc_qsbr_activate() if it could not + * reclaim everything and needs to be called again. + */ + +typedef struct isc_qsbr_registered { + ISC_SLINK(struct isc_qsbr_registered) link; + isc_qsbreclaimer_t *func; +} isc_qsbr_registered_t; +/*%< + * Each reclaimer callback has a static `isc_qsbr_registered_t` object + * so that QSBR can find it. + */ + +void +isc__qsbr_register(isc_qsbr_registered_t *reg); +/*%< + * Requires: + * \li reclaimer->link is not linked + * \li reclaimer->func is not NULL + */ + +#define isc_qsbr_register(cb) \ + do { \ + static isc_qsbr_registered_t registration = { \ + .link = ISC_SLINK_INITIALIZER, \ + .func = cb, \ + }; \ + isc__qsbr_register(®istration); \ + } while (0) +/*%< + * Register a callback function with QSBR. This macro should be used + * inside an `ISC_CONSTRUCTOR` function. There should be one callback + * for eack lock-free data structure implementation, which is able to + * reclaim all the unused memory across all instances of its data + * structure. + */ + +isc_qsbr_phase_t +isc_qsbr_phase(isc_loopmgr_t *loopmgr); +/*%< + * Get the current phase, to use for marking detached objects. + * + * To commit a write that requires cleanup, the ordering must be: + * + * - Use atomic_store_release() to commit the data structure's new + * root pointer; release ordering ensures that the interior changes + * are written before the root pointer. + * + * - Call isc_qsbr_phase() to get the phase to be used for marking + * objects to reclaim. This must happen after the commit, to ensure + * there is at least one grace period between commit and cleanup. + * + * - Pass the same phase to isc_qsbr_activate() so that the reclaimer + * will be called after a grace period has passed. + */ + +void +isc_qsbr_activate(isc_loopmgr_t *loopmgr, isc_qsbr_phase_t phase); +/*%< + * Tell QSBR that objects have been detached and will need reclaiming + * after a grace period. + */ + +/*********************************************************************** + * + * private parts + */ + +/* + * Accessors and constructors for the `grace` variable. + * It contains two bit fields: + * + * - the global phase in the lower ISC_QSBR_PHASE_BITS + * + * - a thread counter in the upper bits + */ + +#define ISC_QSBR_ONE_THREAD (1 << ISC_QSBR_PHASE_BITS) +#define ISC_QSBR_PHASE_MAX (ISC_QSBR_ONE_THREAD - 1) + +#define ISC_QSBR_GRACE_PHASE(grace) (grace & ISC_QSBR_PHASE_MAX) +#define ISC_QSBR_GRACE_THREADS(grace) (grace >> ISC_QSBR_PHASE_BITS) +#define ISC_QSBR_GRACE(threads, phase) \ + ((threads << ISC_QSBR_PHASE_BITS) | phase) + +typedef struct isc_qsbr { + /* + * The `grace` variable keeps track of the current grace period. + * When the phase changes, the thread counter is set to the number of + * threads that need to observe the new phase before the grace period + * can end. + * + * The thread counter is an add-on to the usual EBR fuzzy barrier. + * Counting threads through the barrier adds multi-thread update + * contention, and in EBR the fuzzy barrier runs frequently enough + * (on every access) that it's important to minimize its cost. With + * QSBR, the fuzzy barrier runs less frequently (roughly, per loop, + * instead of per-callback) so contention is less of a concern. The + * thread counter helps to reduce reclaim latency, because unlike EBR + * we don't probabilistically check, we know deterministically when + * all threads have changed phase. + */ + atomic_uint_fast32_t grace; + + /* + * A flag for each phase indicating that there will be work to + * do, so we don't invoke the reclaim machinery unnecessarily. + * Set by `isc_qsbr_activate()` and cleared before the reclaimer + * functions are invoked (so they can re-set their flag if + * necessary). + */ + atomic_uint_fast32_t activated; + + /* + * The time of the last phase transition (isc_nanosecs_t). Used + * to ensure that grace periods do not last forever. We use + * `isc_time_monotonic()` because we need the same time in all + * threads. (`uv_now()` is different in different threads.) + */ + atomic_uint_fast64_t transition_time; + +} isc_qsbr_t; + +/* + * When we start there is no worker thread yet, so the thread + * count is equal to the number of loops. The global phase starts + * off at one (it must always be nonzero). + */ +#define ISC_QSBR_INITIALIZER(nloops) \ + (isc_qsbr_t) { \ + .grace = ISC_QSBR_GRACE(nloops, 1), \ + .transition_time = isc_time_monotonic(), \ + } + +/* + * For use by tests that need to explicitly drive QSBR phase transitions. + */ +void +isc__qsbr_quiescent_state(isc_loop_t *loop); + +/* + * Used by the loopmgr + */ +void +isc__qsbr_quiescent_cb(uv_prepare_t *handle); +void +isc__qsbr_destroy(isc_loopmgr_t *loopmgr); diff --git a/lib/isc/loop.c b/lib/isc/loop.c index 81578b8dd80..61d28c3ba16 100644 --- a/lib/isc/loop.c +++ b/lib/isc/loop.c @@ -26,12 +26,14 @@ #include #include #include +#include #include #include #include #include #include #include +#include #include #include #include @@ -64,8 +66,6 @@ isc_loopmgr_shutdown(isc_loopmgr_t *loopmgr) { isc_loop_t *loop = &loopmgr->loops[i]; int r; - REQUIRE(!atomic_load(&loop->finished)); - r = uv_async_send(&loop->shutdown_trigger); UV_RUNTIME_CHECK(uv_async_send, r); } @@ -143,6 +143,8 @@ destroy_cb(uv_async_t *handle) { uv_close(&loop->destroy_trigger, NULL); uv_close(&loop->queue_trigger, NULL); uv_close(&loop->pause_trigger, NULL); + uv_close(&loop->wakeup_trigger, NULL); + uv_close(&loop->quiescent, NULL); uv_walk(&loop->loop, loop_walk_cb, (char *)"destroy_cb"); } @@ -153,6 +155,8 @@ shutdown_cb(uv_async_t *handle) { isc_loop_t *loop = uv_handle_get_data(handle); isc_loopmgr_t *loopmgr = loop->loopmgr; + loop->shuttingdown = true; + /* Make sure, we can't be called again */ uv_close(&loop->shutdown_trigger, shutdown_trigger_close_cb); @@ -194,6 +198,12 @@ queue_cb(uv_async_t *handle) { } } +static void +wakeup_cb(uv_async_t *handle) { + /* we only woke up to make the loop take a spin */ + UNUSED(handle); +} + static void loop_init(isc_loop_t *loop, isc_loopmgr_t *loopmgr, uint32_t tid) { *loop = (isc_loop_t){ @@ -223,6 +233,13 @@ loop_init(isc_loop_t *loop, isc_loopmgr_t *loopmgr, uint32_t tid) { UV_RUNTIME_CHECK(uv_async_init, r); uv_handle_set_data(&loop->destroy_trigger, loop); + r = uv_async_init(&loop->loop, &loop->wakeup_trigger, wakeup_cb); + UV_RUNTIME_CHECK(uv_async_init, r); + + r = uv_prepare_init(&loop->loop, &loop->quiescent); + UV_RUNTIME_CHECK(uv_prepare_init, r); + uv_handle_set_data(&loop->quiescent, loop); + char name[16]; snprintf(name, sizeof(name), "loop-%08" PRIx32, tid); isc_mem_create(&loop->mctx); @@ -248,6 +265,9 @@ loop_run(isc_loop_t *loop) { job = next; } + r = uv_prepare_start(&loop->quiescent, isc__qsbr_quiescent_cb); + UV_RUNTIME_CHECK(uv_prepare_start, r); + isc_barrier_wait(&loop->loopmgr->starting); r = uv_run(&loop->loop, UV_RUN_DEFAULT); @@ -330,6 +350,7 @@ isc_loopmgr_create(isc_mem_t *mctx, uint32_t nloops, isc_loopmgr_t **loopmgrp) { loopmgr = isc_mem_get(mctx, sizeof(*loopmgr)); *loopmgr = (isc_loopmgr_t){ .nloops = nloops, + .qsbr = ISC_QSBR_INITIALIZER(nloops), }; isc_mem_attach(mctx, &loopmgr->mctx); @@ -463,6 +484,22 @@ isc_loopmgr_run(isc_loopmgr_t *loopmgr) { loop_thread(&loopmgr->loops[0]); } +void +isc_loopmgr_wakeup(isc_loopmgr_t *loopmgr) { + REQUIRE(VALID_LOOPMGR(loopmgr)); + + for (size_t i = 0; i < loopmgr->nloops; i++) { + isc_loop_t *loop = &loopmgr->loops[i]; + + /* Skip current loop */ + if (i == isc_tid()) { + continue; + } + + uv_async_send(&loop->wakeup_trigger); + } +} + void isc_loopmgr_pause(isc_loopmgr_t *loopmgr) { REQUIRE(VALID_LOOPMGR(loopmgr)); @@ -481,7 +518,6 @@ isc_loopmgr_pause(isc_loopmgr_t *loopmgr) { continue; } - REQUIRE(!atomic_load(&loop->finished)); uv_async_send(&loop->pause_trigger); } diff --git a/lib/isc/loop_p.h b/lib/isc/loop_p.h index e9e2fb58839..667f8510974 100644 --- a/lib/isc/loop_p.h +++ b/lib/isc/loop_p.h @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -52,7 +53,6 @@ struct isc_loop { /* states */ bool paused; - atomic_bool finished; bool shuttingdown; /* Async queue */ @@ -69,6 +69,11 @@ struct isc_loop { /* Destroy */ uv_async_t destroy_trigger; + + /* safe memory reclamation */ + uv_async_t wakeup_trigger; + uv_prepare_t quiescent; + isc_qsbr_phase_t qsbr_phase; }; /* @@ -103,6 +108,9 @@ struct isc_loopmgr { /* per-thread objects */ isc_loop_t *loops; + + /* safe memory reclamation */ + isc_qsbr_t qsbr; }; /* diff --git a/lib/isc/qsbr.c b/lib/isc/qsbr.c new file mode 100644 index 00000000000..c122770c143 --- /dev/null +++ b/lib/isc/qsbr.c @@ -0,0 +1,393 @@ +/* + * Copyright (C) Internet Systems Consortium, Inc. ("ISC") + * + * SPDX-License-Identifier: MPL-2.0 + * + * This Source Code Form is subject to the terms of the Mozilla Public + * License, v. 2.0. If a copy of the MPL was not distributed with this + * file, you can obtain one at https://mozilla.org/MPL/2.0/. + * + * See the COPYRIGHT file distributed with this work for additional + * information regarding copyright ownership. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "loop_p.h" + +#define MAX_GRACE_PERIOD_NS 53 * NS_PER_MS + +#if 0 +#define TRACE(fmt, ...) \ + isc_log_write(isc_lctx, ISC_LOGCATEGORY_GENERAL, ISC_LOGMODULE_OTHER, \ + ISC_LOG_DEBUG(7), "%s:%u:%s():t%u: " fmt, __FILE__, \ + __LINE__, __func__, isc_tid(), ##__VA_ARGS__) +#else +#define TRACE(...) +#endif + +static ISC_STACK(isc_qsbr_registered_t) qsbreclaimers = ISC_STACK_INITIALIZER; + +static void +reclaim_cb(void *arg); +static void +reclaimed_cb(void *arg); + +/**********************************************************************/ + +/* + * 3,2,1,3,2,1,... + */ +static isc_qsbr_phase_t +change_phase(isc_qsbr_phase_t phase) { + return (--phase > 0 ? phase : ISC_QSBR_PHASE_MAX); +} + +/* + * For marking or checking that a phase has cleanup work to do. + */ +static unsigned int +active_bit(isc_qsbr_phase_t phase) { + return (1 << phase); +} + +/* + * Extract the global phase from the grace period state. + */ +static isc_qsbr_phase_t +global_phase(isc_qsbr_t *qsbr, memory_order m_o) { + uint32_t grace = atomic_load_explicit(&qsbr->grace, m_o); + return (ISC_QSBR_GRACE_PHASE(grace)); +} + +/* + * Record that the current thread has passed the barrier. + * Returns true if more threads still need to pass. + * + * ATOMIC: acquire-release, to ensure that this is not reordered wrt + * read-only accesses to lock-free data structures. This implements the + * ordering requirements of a quiescent state. + */ +static bool +fuzzy_barrier_not_yet(isc_qsbr_t *qsbr) { + uint32_t grace = atomic_fetch_sub_acq_rel(&qsbr->grace, + ISC_QSBR_ONE_THREAD); + uint32_t threads = ISC_QSBR_GRACE_THREADS(grace); + return (threads > 1); +} + +/* + * Ungracefully drive all cleanup work to completion. + * + * ATOMIC: everything is relaxed, because we assume that concurrent + * readers have already finished. `reclaim_cb()` uses the `activated` + * flags to ensure it is OK that threads will race to complete the + * cleanup. + */ +static void +qsbr_shutdown(isc_loopmgr_t *loopmgr) { + isc_qsbr_t *qsbr = &loopmgr->qsbr; + isc_qsbr_phase_t phase = global_phase(qsbr, memory_order_relaxed); + uint32_t threads = isc_loopmgr_nloops(loopmgr); + uint32_t grace; + + while (atomic_load_relaxed(&qsbr->activated) != 0) { + reclaim_cb(loopmgr); + phase = change_phase(phase); + grace = ISC_QSBR_GRACE(threads, phase); + atomic_store_relaxed(&qsbr->grace, grace); + } +} + +/* + * On a quiet server that does not have enough network traffic to keep + * all its threads spinning, grace periods might extend indefinitely. + * So check if we have been waiting an unreasonably long time since + * the last phase change. If so, send a no-op async request to every + * thread to make them all cycle through a quiescent state. + */ +static void +maybe_wakeup(isc_loop_t *loop) { + isc_loopmgr_t *loopmgr = loop->loopmgr; + isc_qsbr_t *qsbr = &loopmgr->qsbr; + + /* + * ATOMIC: relaxed is OK here because we don't use any values guarded + * by the `activated` flags. + */ + if (atomic_load_relaxed(&qsbr->activated) == 0) { + return; + } + if (loop->shuttingdown) { + qsbr_shutdown(loopmgr); + return; + } + + /* + * ATOMIC: relaxed, because the `transition_time` doesn't guard any + * other values, just the isc_loopmgr_wakeup() call below. + */ + atomic_uint_fast64_t *qsbr_ttp = &qsbr->transition_time; + isc_nanosecs_t now = isc_time_monotonic(); + isc_nanosecs_t start = atomic_load_relaxed(qsbr_ttp); + if (now < start + MAX_GRACE_PERIOD_NS) { + return; + } + + /* + * To stop other threads from also invoking `isc_loopmgr_wakeup()`, + * we try to push the timer into the future (expecting that it will + * not trigger again), and quit if someone else got there first. + * ATOMIC: relaxed, as before; strong, because there is no retry loop. + */ + if (!atomic_compare_exchange_strong_relaxed(qsbr_ttp, &start, now)) { + return; + } + + TRACE("long grace period of %llu ns, waking up other threads", + (unsigned long long)(now - start)); + + isc_loopmgr_wakeup(loopmgr); +} + +/* + * Callers use the fuzzy barrier to ensure only one thread can enter + * this function at a time. + * + * Phase transitions happen at roughly the same frequency that IO + * event loops cycle, limited by the slowest loop in each cycle. + */ +static void +phase_transition(isc_loop_t *loop, isc_qsbr_phase_t current_phase) { + isc_loopmgr_t *loopmgr = loop->loopmgr; + isc_qsbr_t *qsbr = &loopmgr->qsbr; + + if (loop->shuttingdown) { + qsbr_shutdown(loopmgr); + return; + } + + /* + * After we change phase, threads will be in either the `current_phase` + * or the `next_phase`. We will reclaim memory from the `third_phase`. + * + * ATOMIC: relaxed is OK here because the necessary synchronization + * happens in `reclaim_cb()`. + */ + isc_qsbr_phase_t next_phase = change_phase(current_phase); + isc_qsbr_phase_t third_phase = change_phase(next_phase); + bool activated = atomic_load_relaxed(&qsbr->activated) & + active_bit(third_phase); + + /* + * Reset the wakeup timer, and log the length of the grace period. + * ATOMIC: relaxed, per the commentary in `maybe_wakeup()`. + */ + atomic_uint_fast64_t *qsbr_tt = &qsbr->transition_time; + isc_nanosecs_t now = isc_time_monotonic(); + isc_nanosecs_t start = atomic_exchange_relaxed(qsbr_tt, now); + TRACE("phase %u -> %u after grace period of %f ms", current_phase, + next_phase, (double)(now - start) / NS_PER_MS); + UNUSED(start); /* ifndef TRACE() */ + + /* + * Work out the threads counter for this grace period. + * + * We need to add one for any reclamation worker thread, to + * prevent us from changing phase before the work is done. If + * we change too early, any newly detached objects will be + * marked with the same phase as the running reclaimer, which + * might lead to them being free()d too soon. + */ + uint32_t threads = isc_loopmgr_nloops(loopmgr) + (activated ? 1 : 0); + + /* + * Start the new grace period. + * + * ATOMIC: release, to pair with the load-acquire in `reclaim_cb()` + * which is spawned in a separate worker thread. + */ + uint32_t grace = ISC_QSBR_GRACE(threads, next_phase); + atomic_store_release(&qsbr->grace, grace); + + if (activated) { + isc_work_enqueue(loop, reclaim_cb, reclaimed_cb, loopmgr); + } +} + +/* + * This function is called once per cycle of each IO event loop by the + * `uv_prepare` callback below. + */ +void +isc__qsbr_quiescent_state(isc_loop_t *loop) { + isc_loopmgr_t *loopmgr = loop->loopmgr; + isc_qsbr_t *qsbr = &loopmgr->qsbr; + + /* + * ATOMIC: relaxed. If we are in phase then we don't need to + * synchronize; if we are not then this thread's presence in + * the thread counter will prevent the phase from changing + * before we get to the fuzzy barrier. + */ + isc_qsbr_phase_t phase = global_phase(qsbr, memory_order_relaxed); + if (loop->qsbr_phase == phase) { + maybe_wakeup(loop); + return; + } + + /* + * Enter the current phase and count us out of the previous phase. + */ + loop->qsbr_phase = phase; + if (fuzzy_barrier_not_yet(qsbr)) { + maybe_wakeup(loop); + return; + } + + /* + * We were the last thread to enter the current phase so the + * grace period is up. No other thread can reach this point. + */ + phase_transition(loop, phase); +} + +void +isc__qsbr_quiescent_cb(uv_prepare_t *handle) { + isc_loop_t *loop = uv_handle_get_data((uv_handle_t *)handle); + isc__qsbr_quiescent_state(loop); +} + +static void +reclaimed_cb(void *arg) { + /* we are back on a loop thread */ + isc_loopmgr_t *loopmgr = arg; + isc_qsbr_t *qsbr = &loopmgr->qsbr; + isc_loop_t *loop = CURRENT_LOOP(loopmgr); + + /* + * Remove the reclaimers from the thread count, so that the + * next grace period can start. + */ + if (fuzzy_barrier_not_yet(qsbr)) { + return; + } + + /* + * The reclaimers were the last thread to be counted out: every + * other thread already passed through a quiescent state. + * + * We expect loop->qsbr_phase == global_phase() at this point, + * except during shutdown when the phase shifts rapidly. Also, + * the current loop might not have received the shutdown + * message yet, so it seems easiest to omit the assertion. + * + * ATOMIC: relaxed, the fuzzy barrier already synchronized. + */ + TRACE("reclaimers overran"); + phase_transition(loop, global_phase(qsbr, memory_order_relaxed)); +} + +static void +reclaim_cb(void *arg) { + /* we are on a work thread not a loop thread */ + isc_loopmgr_t *loopmgr = arg; + isc_qsbr_t *qsbr = &loopmgr->qsbr; + + /* + * The global phase has just been bumped by a `phase_transition()` + * and it cannot change again until the grace period is up, which + * cannot happen until we have finished working. + * + * ATOMIC: acquire, to pair with the release in `phase_transition()`. + * + * The phase we are to clean up is 2 before the current phase, + * which is the same as the one after the current phase (mod 3). + */ + isc_qsbr_phase_t cur_phase = global_phase(qsbr, memory_order_acquire); + isc_qsbr_phase_t third_phase = change_phase(cur_phase); + unsigned int third_bit = active_bit(third_phase); + + /* + * If any reclaimers need to be called again later, they can use + * `isc_qsbr_activate()`, so we need to clear the bit first. + * + * ATOMIC: acquire, so that `isc_qsbr_activate()` happens before + * the callbacks are invoked. + */ + uint32_t activated = atomic_fetch_and_explicit( + &qsbr->activated, ~third_bit, memory_order_acquire); + + /* this can happen when we are racing to clean up on shutdown */ + if ((activated & third_bit) == 0) { + return; + } + + isc_qsbr_registered_t *reclaimer = ISC_STACK_TOP(qsbreclaimers); + while (reclaimer != NULL) { + reclaimer->func(third_phase); + reclaimer = ISC_SLINK_NEXT(reclaimer, link); + } +} + +void +isc__qsbr_register(isc_qsbr_registered_t *reclaimer) { + REQUIRE(reclaimer->func != NULL); + ISC_STACK_PUSH(qsbreclaimers, reclaimer, link); +} + +/* + * ATOMIC: This function needs to ensure that the global phase is read + * after a write has committed. Acquire/release ordering is not sufficient + * for ordering between separate atomics (the data structure's root pointer + * and the global phase), so it must be sequentially consistent. + * + * In general, the phases up to and including the next phase transition + * look like: + * + * 1. local phase + * 2. global phase + * 3. next phase + * 1. third phase + * + * i.e. some threads are still one behind the global phase, on the same + * phase that will be cleaned up immediately after the phase transition. + * + * This function is called just after a write commits. It's likely that + * some threads on the global phase (2) are using a version of the data + * structure from before the write, and they can continue using it while + * the straggler threads (1) catch up and cause a phase transition. + * + * The writer can be one of the straggler threads. If it incorrectly marks + * cleanup work with its local phase (1), memory will be reclaimed + * immediately after the next phase transition (when the third phase is + * also 1), which could be almost immediately when the writer returns to + * the event loop. This will cause a use-after-free for existing readers + * (in phase 2). + * + * More straightforwardly, we need to be able to queue up reclaim work from + * a thread that isn't running a loop, which also means this function has + * to return the global phase. + */ +isc_qsbr_phase_t +isc_qsbr_phase(isc_loopmgr_t *loopmgr) { + isc_qsbr_t *qsbr = &loopmgr->qsbr; + return (global_phase(qsbr, memory_order_seq_cst)); +} + +void +isc_qsbr_activate(isc_loopmgr_t *loopmgr, isc_qsbr_phase_t phase) { + /* + * ATOMIC: release ordering ensures that writing the cleanup lists + * happens before the callback is invoked from a worker thread. + */ + atomic_fetch_or_release(&loopmgr->qsbr.activated, active_bit(phase)); +}