--- /dev/null
+<!--
+Copyright (C) Internet Systems Consortium, Inc. ("ISC")
+
+SPDX-License-Identifier: MPL-2.0
+
+This Source Code Form is subject to the terms of the Mozilla Public
+License, v. 2.0. If a copy of the MPL was not distributed with this
+file, you can obtain one at https://mozilla.org/MPL/2.0/.
+
+See the COPYRIGHT file distributed with this work for additional
+information regarding copyright ownership.
+-->
+
+A qp-trie for the DNS
+=====================
+
+A qp-trie is a data structure that supports lookups in a sorted
+collection of keys. It is efficient both in terms of fast lookups and
+using little memory. It is particularly well-suited for use in DNS
+servers.
+
+These notes outline how BIND's `dns_qp` implementation works, how it
+is optimized for lookups keyed by DNS names, and how it supports
+multi-version concurrency.
+
+
+data structure zoo
+------------------
+
+Chasing a pointer indirection is very slow, up to 100ns, whereas a
+sequential memory access takes less than 10ns. So, to make a data
+structure fast, we need to minimize indirections.
+
+There is a tradeoff between speed and flexibility in standard data
+structures:
+
+ * Arrays are very simple and fast (a lookup goes straight to the
+ right address), but the key can only be a small integer.
+
+ * Hash tables allow you to use arbitrary lookup keys (such as
+ strings), but may require probing multiple addresses to find the
+ right element.
+
+ * Radix trees allow you to do lookups based on the sorting order of
+ the keys, provided it is lexical like `memcmp()`; however, lookups
+ require multiple indirections.
+
+ * Comparison search trees (binary trees and B-trees) allow you to
+ use an arbitrary ordering predicate, but each indirection during
+ a lookup also requires a comparison.
+
+In the DNS, we need to use some kind of tree to support the kinds of
+lookup required for DNSSEC: find longest match, find nearest
+predecessor or successor, and so forth. So what kind of tree is best?
+
+
+in theory
+---------
+
+In a tree where the average length of a key is `k`, and the number of
+elements in the tree is `n`, the theoretical performance bounds are,
+for a comparison tree:
+
+ * `Ω(k * log n)`
+ * `Ο(k * n)`
+
+And for a radix tree:
+
+ * `Ω(k + log n)`
+ * `Ο(k + k)`
+
+Here, `Ω()` is the lower bound and `Ο()` is the upper bound; we
+expect typical performance to be close to the lower bound.
+
+The multiplications in the comparison tree expressions means that each
+indirection requires a comparison `Ο(k)`, whereas they are additions
+in the radix tree expressions because a radix tree traversal only
+needs one key comparison.
+
+The upper bounds say that (in the absence of balancing) a comparison
+tree can devolve into a linked list of nodes, whereas the shape of a
+radix tree is determined by the set of keys independent of the order
+of insertion or the number of keys.
+
+The logarithms hide some interesting constant factors. In a binary
+tree, the log is base 2. In a radix tree, the radix is the base of the
+logarithm. So, if we increase the radix, the constant factor gets
+smaller. The rough equivalent for a binary tree would be to use a
+B-tree instead, but although B-trees have fewer indirections they do
+not reduce the number of comparisons.
+
+In implementation terms, a larger radix means tree nodes get wider
+and the tree becomes shallower. A shallower tree requires fewer
+indirections, so it should be faster. The trick is to increase the
+radix without blowing up the tree's memory usage, which can lose
+more performance than we win.
+
+This analysis suggests that a radix tree is better than a comparison
+tree, provided keys can be compared lexically - which is true for DNS
+names, with some rearrangement (described below). When using big-o
+notation, we also need to be wary of the constant factors; but in this
+case they also favour a radix tree, especially with the optimization
+tricks used by BIND's qp-trie.
+
+Note: "radix" comes from the latin for "root", so "radix tree" is a
+pun, which is geekily amusing especially when talking about logs.
+
+
+what is a trie?
+---------------
+
+A trie is another name for a radix tree (or "digital tree" according
+to Knuth). It is short for information reTRIEval, and I pronounce it
+exactly like "tree" (though Knuth pronounces it like "try").
+
+In a trie, keys are divided into digits depending on some radix e.g.
+base 2 for binary tries, base 256 for byte-indexed tries. When
+searching the trie, successive digits in the key, from most to least
+significant, are used to select branches from successive nodes in
+the trie, roughly like:
+
+ for (offset = 0; isbranch(node); offset++)
+ node = node->child[key[offset]];
+
+All of the keys in a subtrie have identical prefixes. Tries do not
+need to store keys since they are implicit in the structure.
+
+
+binary crit-bit trees
+---------------------
+
+A patricia trie is a binary trie which omits nodes that have only one
+child. Dan Bernstein calls his tightly space-optimized version a
+"crit-bit tree".
+https://cr.yp.to/critbit.html
+https://github.com/agl/critbit/
+
+Unlike a basic trie, a crit-bit tree skips parts of the key when
+every element in a subtree shares the same sequence of bits.
+Each node is annotated with the offset of the bit that is used to
+select the branch; offsets always increase as you go deeper into
+the tree.
+
+ while (isbranch(node))
+ node = node->child[key[node->offset]];
+
+In a crit-bit tree the keys are not implicit in the structure
+because parts of them are skipped. Therefore, each leaf refers to a
+copy of its key so that when you find a leaf you can verify that the
+skipped bits match.
+
+
+prefetching
+-----------
+
+Observe that in the loop above, the current node has only one child
+pointer, and the child nodes are adjacent in memory. This means it
+is possible to tell the CPU to prefetch the child nodes before
+extracting the critical bit from the key and choosing which child is
+next. A qp-trie has a similar layout, but it has more child nodes
+(still adjacent in memory) and it does more computation to choose
+which one is next.
+
+When I originally invented the qp-trie code, I found that explicit
+prefetch hints made the qp-trie substantially faster and the crit-bit
+tree slightly faster. The hints help the CPU to do useful work at the
+same time as the memory subsystem. (This is unusual for linked data
+structures, which tend to alternate between CPU waiting for memory,
+and memory waiting for CPU.)
+
+Large modern CPUs (after about 2015) are better at prefetching
+automatically, so the explicit hint is less important than it used to
+be, but `lib/dns/qp.c` still has `__builtin_prefetch()` hints in its
+inner traversal loops.
+
+
+packed sparse vectors with popcount
+-----------------------------------
+
+The `popcount` instruction counts the number of bits that are set
+in a word. It's also known as the Hamming weight; Knuth calls it
+"sideways add". https://en.wikipedia.org/wiki/popcount
+
+You can use `popcount` to implement a sparse vector of length `N`
+containing `M <= N` members using bitmap of length `N` and a packed
+vector of `M` elements. A member `b` is present in the vector if bit
+`b` is set, so `M == popcount(bitmap)`. The index of member `b` in
+the packed vector is the popcount of the bits preceding `b`.
+
+ // size of vector
+ size = popcount(bitmap);
+ // bit position
+ bit = 1 << b;
+ // is element present?
+ if (bitmap & bit) {
+ // mask covers the preceding elements
+ mask = bit - 1;
+ // position of element in packed vector
+ pos = popcount(bitmap & mask);
+ // fetch element
+ elem = vector[pos];
+ }
+
+See "Hacker's Delight" by Hank Warren, section 5-1 "Counting 1
+bits", subsection "applications". http://www.hackersdelight.org
+
+See under _"bitmap popcount shenanigans"_ in `lib/dns/qp.c` for how
+this is implemented in BIND.
+
+
+popcount for trie nodes
+-----------------------
+
+Phil Bagwell's hashed array-mapped tries (HAMT) use popcount for
+compact trie nodes. In a HAMT, string keys are hashed, and the hash is
+used as the index to the trie, with radix 2^32 or 2^64.
+http://infoscience.epfl.ch/record/64394/files/triesearches.pdf
+http://infoscience.epfl.ch/record/64398/files/idealhashtrees.pdf
+
+As discussed above, increasing the radix makes the tree shallower, so
+it should be faster. The downside is usually much greater memory
+overhead. Child vectors are often sparsely populated, so we can
+greatly reduce the overhead by packing them with popcount.
+
+The HAMT relies on hashing, which keeps keys dense. This means it
+can be laid out like a basic trie with implicit keys (i.e. hash
+values). The disadvantage of hashing is that strings are stored
+out of order.
+
+
+qp-trie
+-------
+
+A qp-trie is a mash-up of Bernstein's crit-bit tree with Bagwell's
+HAMT. Like a crit-bit tree, a qp-trie omits nodes with one child;
+nodes include a key offset; and keys a referenced from leaves instead
+of being implicit in the trie structure. Like a HAMT, nodes have a
+popcount packed vector of children, but unlike a HAMT, keys are not
+hashed.
+
+A qp-trie is faster than a crit-bit tree and uses less memory, because
+its wider fan-out requires fewer nodes and popcount packs them very
+efficiently. Like a crit-bit tree but unlike a HAMT, a qp-trie stores
+keys in lexical order.
+
+As in a HAMT, the original layout of a qp-trie node is a pair of
+words, which are used as key and value pointers in leaf nodes, and
+index word and pointer in branch nodes. The index word contains the
+popcount bitmap (as in a HAMT) and the offset into the key (as in a
+crit-bit tree), as well as a leaf/branch tag bit. The pointer refers
+to the branch node's "twigs", which is what we call the packed sparse
+vector of child nodes.
+
+The fan-out of a qp-trie is limited by the need to fit the bitmap and
+the nybble offset into a 64-bit word; a radix of 16 or 32 works well,
+and 32 is slightly faster (though 5-bit nybbles are fiddly). But radix
+64 requires an extra word per node, and the extra memory overhead
+makes it slower as well as bulkier.
+
+Early qp-trie implementations used a node layout like the
+following. However, in practice C bitfields have too many
+portability gotchas to work well. It is better to use hand-written
+shifting and masking to access the parts of the index word.
+
+ #define NYBBLE 4 // or 5
+ #define RADIX (1 << NYBBLE)
+
+ union qp_node {
+ struct {
+ unsigned tag : 1;
+ unsigned bitmap : RADIX;
+ unsigned offset : (64 - 1 - RADIX);
+ union qp_node *twigs;
+ } branch;
+ struct {
+ void *value;
+ const char *key;
+ } leaf;
+ };
+
+
+DNS qp-trie
+-----------
+
+BIND uses a variant of a qp-trie optimized for DNS names. DNS names
+almost always use the usual hostname alphabet of (case-insensitive)
+letters, digits, hyphen, plus underscore (which is often used in the DNS
+for non-hostname purposes), and finally the label separator (which is
+written as '.' in presentation-format domain names, and is the label
+length in wire format). This adds up to 39 common characters.
+
+A bitmap for 39 common characters is small enough to fit into a
+qp-trie index word, so we can (in principle) walk down the trie one
+character at a time, as if the radix were 256, but without needing a
+multi-word bitmap.
+
+However, DNS names can contain arbitrary bytes. To support the 200-ish
+unusual characters we use an escaping scheme, described in more detail
+below. This requires a few more bits in the bitmap to represent the
+escape characters, so our radix ends up being 47. This still fits into
+the 64-bit index word, so we get the compactness of a qp-trie but with
+faster byte-at-a-time lookups for DNS names that use common hostname
+characters.
+
+You can also use other kinds of keys with BIND's DNS qp-trie, provided
+they are not too long. You must provide your own key preparation
+function, e.g. for uniform binary keys you might extract 5-bit nybbles
+to get a radix-32 trie.
+
+
+preparing a lookup key
+----------------------
+
+A DNS name needs to be rearranged to use it as a qp-trie key, so that
+the lexical order of rearranged keys matches the canonical DNS name
+order specified in RFC 4034 section 6.1:
+
+ * reverse the order of the labels so that they run from most
+ significant to least significant, left to right (but the
+ characters in each label remain in the same order)
+
+ * convert uppercase ASCII letters to lowercase ASCII
+
+ * change the label separators to a non-byte value that sorts before
+ the zero byte
+
+For qp-trie lookups there are a couple of extra steps:
+
+ * There is an escaping mechanism to support DNS names that use
+ unusual characters. Common characters use one byte in the lookup
+ key, but unusual characters are expanded to two bytes. To preserve
+ the correct lexical order, there are different escape bytes
+ depending on how the unusual character sorts relative to the
+ common hostname characters.
+
+ * Characters in the DNS name need to be converted to bitmap
+ positions. This is done at the same time as preparing the lookup
+ key, to move work out of the inner trie traversal loop.
+
+These 5 transformations can be done in a single pass over a DNS name
+using a single lookup table. The transformed name is usually the
+same length (up to 2x longer if it contains unusual characters).
+
+You can use absolute or relative DNS names as keys, without ambiguity
+(provided you have some way of knowing what names are relative to).
+When converted to a lookup key, absolute names start with a non-byte
+value representing the root, and relative names do not.
+
+Lookup keys are ephemeral, allocated on the stack during a lookup.
+
+See under _"converting DNS names to trie keys"_ in `lib/dns/qp.c`
+for how this is implemented in BIND.
+
+
+node layout
+-----------
+
+Earlier I said that the original qp-trie node layout consists of two
+words: one 64 bit word for the branch index, and one pointer-sized
+word. BIND's qp-trie uses a layout that is smaller on 64-bit systems:
+one 64 bit word and one 32-bit word.
+
+A branch node contains
+
+ * a branch/leaf tag bit
+
+ * a 47-wide bitmap, with a bit for each common hostname character
+ and each escape character
+
+ * a 9-bit key offset, enough to count twice the length of a DNS
+ name
+
+ * a 32-bit "twigs" reference to the packed vector of child nodes;
+ these references are described in more detail below
+
+A leaf node contains a pointer value (which we assume to be 64 bits)
+and a 32-bit integer value. The branch/leaf tag is smuggled into the
+low-order bit of the pointer value, so the pointer value must have
+large enough alignment. (This requirement is checked when a leaf is
+added to the trie.) Apart from that, the meaning of leaf values
+is entirely under control of the qp-trie user.
+
+When constructing a qp-trie the user provides a collection of method
+pointers. The qp-trie code calls these methods when it needs to do
+anything that needs to look into a leaf value, such as extracting the
+key.
+
+See under _"interior node basics"_ and _"interior node constructors
+and accessors"_ in `lib/dns/qp_p.h` for the implementation.
+
+
+example
+-------
+
+Consider a small zone:
+
+ example. ; apex
+ mail.example. ; IMAP server
+ mx.example. ; incoming mail
+ www.example. ; web load balancer
+ www1.example. ; back-end web servers
+ www2.example.
+
+It becomes a qp-trie as follows. I am writing bitmaps as lists of
+characters representing the bits that are set, with `'.'` for label
+separators. I have used arbitrary names for the addresses of the twigs
+vectors.
+
+ root = (qp_node){
+ tag: BRANCH,
+ offset: 9,
+ bitmap: [ '.', 'm', 'w' ],
+ twigs: &one,
+ };
+
+Note that the offset skips the root zone, the zone name, and the apex
+label separator. If the offset is beyond the end of the key, the byte
+value is the label separator.
+
+ one = (qp_node[3]){
+ {
+ tag: LEAF,
+ key: "example.",
+ },
+ {
+ tag: BRANCH,
+ offset: 10,
+ bitmap: [ 'a', 'x' ],
+ twigs: &two,
+ },
+ {
+ tag: BRANCH,
+ offset: 12,
+ bitmap: [ '.', '1', '2' ],
+ twigs: &three,
+ },
+ };
+
+This twigs vector has an element for the zone apex, and the two
+different initial characters of the subdomains.
+
+The mail servers differ in the next character, so the offset bumps from
+9 to 10 without skipping any characters. The web servers all start with
+www, so the offset bumps from 9 to 12, skipping the common prefix.
+
+ two = (qp_node[2]){
+ {
+ tag: LEAF,
+ key: "mail.example.",
+ },
+ {
+ tag: LEAF,
+ key: "mx.example.",
+ },
+ };
+
+The different lengths of `mail` and `mx` don't matter: we implicitly
+skip to the end of the key when we reach a leaf node.
+
+ three = (qp_node[3]){
+ {
+ tag: LEAF,
+ key: "www.example.",
+ },
+ {
+ tag: LEAF,
+ key: "www1.example.",
+ },
+ {
+ tag: LEAF,
+ key: "www2.example.",
+ },
+ };
+
+When the trie includes labels of differing lengths, we can have a node
+that chooses between a label separator and characters from the longer
+labels. This is slightly different from the root node, which tested the
+first character of the label; here we are testing the last character.
+
+
+memory management for concurrency
+---------------------------------
+
+The following sections discuss how the qp-trie supports concurrency.
+
+The requirement is to support many concurrent read threads, and
+allow updates to occur without blocking readers (or blocking readers
+as little as possible).
+
+The strategy is to use "copy-on-write", that is, when an update
+needs to alter the trie it makes a copy of the parts that it needs
+to change, so that concurrent readers can continue to use the
+original. (It is analogous to multiversion concurrency in databases
+such as PostgreSQL, where copy-on-write uses a write-ahead log.)
+
+Software that uses copy-on-write needs some mechanism for clearing
+away old versions that are no longer in use. (For example, VACUUM in
+PostgreSQL.) The qp-trie code uses a custom allocator with a simple
+garbage collector; as well as supporting concurrency, the qp-trie's
+memory manager makes tries smaller and faster.
+
+
+allocation
+----------
+
+A qp-trie is relatively demanding on its allocator. Twigs vectors
+can be lots of different sizes, and every mutation of the trie
+requires an alloc and/or a free.
+
+Older versions of the qp-trie code used the system allocator. Many
+allocators (such as `jemalloc`) segregate the heap into different
+size classes, so that each chunk of memory is dedicated to
+allocations of the same size. While this memory layout provides good
+locality when objects of the same type have the same size, it tends
+to scatter the interior nodes of a qp-trie all over the address space.
+
+BIND's qp-trie code uses a "bump allocator" for its interior nodes,
+which is one of the simplest and fastest possible: an allocation
+usually only requires incrementing a pointer and checking if it has
+reached a limit. (If the check fails the allocator goes into its
+slow path.) Allocations have good locality because they write
+sequentially into memory. (A bit like a write-ahead log.)
+
+Bump allocators need reasonably large contiguous chunks of empty
+memory to make the most of their efficiency, so they are often
+coupled with some kind of compacting garbage collector, which
+defragments the heap to recover free space.
+
+See `alloc_twigs()` in `lib/dns/qp.c` for the bump allocator fast
+path.
+
+
+garbage collection
+------------------
+
+[The Garbage Collection Handbook](https://gchandbook.org/) says
+there are four basic kinds of automatic memory management.
+
+Reference counting is used by scripting languages such as Perl and
+Python, and also for manual memory management such as in operating
+system kernels and BIND.
+
+To avoid writing a custom allocator, I previously tried adapting the
+qp-trie code to use refcounting to support copy-on-write, but I was
+not very happy with the complexity of the implementation, and I
+thought it was ugly that I needed to modify refcounts in nodes that
+were logically read-only.
+
+(Two other kinds of GC are mark-sweep and mark-compact. Both of them
+have a similar disadvantage to refcounting: a simple GC mark phase
+modifies nodes that are logically read-only. And mark-sweep leaves
+memory fragmented so it does not support a bump allocator.)
+
+The fourth kind is copying garbage collection. It works well with a
+bump allocator, because copying the data structure using a bump
+allocator in the most obvious way naturally compacts the data. And
+the copying phase of the GC can run concurrently with readers
+without interference.
+
+BIND's qp-trie code uses a copying garbage collector only for its
+interior nodes. The value objects that are attached to the leaves of
+the trie are allocated by `isc_mem` and use reference counting like
+the rest of BIND.
+
+See `compact()` in `lib/dns/qp.c` for the copying phase of the
+garbage collector. Reference counting for value objects is handled
+by the `attach()` and `detach()` qp-trie methods.
+
+
+memory layout
+-------------
+
+BIND's qp-trie code organizes its memory as a collection of "chunks",
+each of which is a few pages in size and large enough to hold a few
+thousand nodes.
+
+Most memory management is per-chunk: obtaining memory from the
+system allocator and returning it; keeping track of which chunks are
+in use by readers, and which chunks can be mutated; and counting
+whether chunks are fragmented enough to need garbage collection.
+
+As noted above, we also use the chunk-based layout to reduce the size
+of interior nodes. Instead of using a native pointer (typically 64
+bits) to refer to a node, we use a 32 bit integer containing the chunk
+number and the position of the node in the chunk. This reduces the
+memory used by interior nodes by 25%.
+
+In `lib/dns/qp_p.h`, the _"main qp-trie structures"_ hold information
+about a trie's chunks. Most of the chunk handling code is in the
+_"allocator"_ and _"chunk reclamation"_ sections in `lib/dns/qp.c`.
+
+
+lifecycle of value objects
+--------------------------
+
+A leaf node contains a pointer to a value object that is not managed
+by the qp-trie garbage collector. Instead, the user provides
+`attach` and `detach` methods that the qp-trie code calls to update
+the reference counts in the value objects.
+
+Value object reference counts do not indicate whether the object is
+mutable: its refcount can be 1 while it is only in use by readers
+(and must be left unchanged), or newly created by a writer (and
+therefore mutable).
+
+So, callers must keep track themselves whether leaf objects are newly
+inserted (and therefore mutable) or not. XXXFANF this might change, by
+adding special lookup functions that return whether leaf objects are
+mutable - see the "todo" in `include/dns/qp.h`.
+
+
+locking and RCU
+---------------
+
+The Linux kernel has a collection of copy-on-write schemes collectively
+called read-copy-update; there is also https://liburcu.org/ for RCU in
+userspace. RCU is attractively speedy: readers can proceed without
+blocking at all; writers can proceed concurrently with readers, and
+updates can be committed without blocking. A commit is just a single
+atomic pointer update. RCU only requires writers to block when waiting
+for a "grace period" while older readers complete their critical
+sections, after which the writer can free memory that is no longer in
+use. Writers must also block on a mutex to ensure there is only one
+writer at a time.
+
+The qp-trie concurrency strategy is designed to be able to use RCU, but
+RCU is not required. Instead of RCU we can use a reader-writer lock.
+This requires readers to block when a writer commits, which (in RCU
+style) just requires an atomic pointer swap. The rwlock also changes
+when writers must block: commits must wait for readers to exit their
+critical sections, but there is no further waiting to be able to release
+memory.
+
+In BIND, there are two kinds of reader: queries, which are relatiely
+quick, and zone transfers, which are relatively slow. BIND's dbversion
+machinery allows updates to proceed while there are long-running zone
+transfers. RCU supports this without further machinery, but a
+reader-writer lock needs some help so that long-running readers can
+avoid blocking writers.
+
+To avoid blocking updates, long-running readers can take a snapshot of a
+qp-trie, which only requires copying the allocator's chunk array. After
+a writer commits, it does not releases memory if there are any
+snapshots. Instead, chunks that are no longer needed by the latest
+version of the trie are stashed on a list to be released later,
+analogous to RCU waiting for a grace period.
+
+The locking occurs only in the functions under _"read-write
+transactions"_ and _"read-only transactions"_ in `lib/dns/qp.c`.
+
+
+immutability and copy-on-write
+------------------------------
+
+A qp-trie has a `generation` counter which is incremented by each
+write transaction. We keep track of which generation each chunk was
+created in; only chunks created in the current generation are
+mutable, because older chunks may be in use by concurrent readers.
+
+This logic is implemented by `chunk_alloc()` and `chunk_mutable()`
+in `lib/dns/qp.c`.
+
+The `make_twigs_mutable()` function ensures that a node is mutable,
+copying it if necessary.
+
+The chunk arrays are a mixture of mutable and immutable. Pointers to
+immutable chunks are immutable; new chunks can be assigned to unused
+entries; and entries are cleared when it is safe to reclaim the chunks
+they refer to. If the chunk arrays need to be expanded, the existing
+arrays are retained for use by readers, and the writer uses the
+expanded arrays (see `alloc_slow()`). The old arrays are cleaned up
+after the writer commits.
+
+
+update transactions
+-------------------
+
+A typical heavy-weight `update` transaction comprises:
+
+ * make a copy of the chunk arrays in case we need to roll back
+
+ * get a freshly allocated chunk where new nodes or copied nodes
+ can be written
+
+ * make any changes that are required; nodes in old chunks are
+ copied to the new space first; new nodes are modified in place
+ to avoid creating unnecessary garbage
+
+ * when the updates are finished, and before committing, run the
+ garbage collector to clear out chunks that were fragmented by the
+ update
+
+ * shrink the allocation chunk to eliminate unused space
+
+ * commit the update by flipping the root pointer of the trie; this
+ is the only point that needs a multithreading interlock
+
+ * free any chunks that were emptied by the garbage collector
+
+A lightweight `write` transaction is similar, except that:
+
+ * rollback is not supported
+
+ * any existing allocation chunk is reused if possible
+
+ * the gabage collector is not run before committing
+
+ * the allocation chunk is not shrunk
+
+
+testing strategies
+------------------
+
+The main qp-trie test is in `tests/dns/qpmulti_test.c`. This uses
+randomized testing of the transactional API, with a lot of consistency
+checking to detect bugs.
+
+There are also a couple of fuzzers, which aim to benefit from
+coverage-guided exploration of the test space and test minimization.
+In `fuzz/dns_qp.c` we treat the fuzzer input as a bytecode to exercise
+the single-threaded API, and `fuzz/dns_qpkey_name.c` checks conversion
+from DNS names to lookup keys.
+
+In `tests/bench` there are a few benchmarks. `load-names` does a very
+basic comparison between BIND's hash table, red-black tree, and
+qp-trie. `qpmulti` checks multicore performance of the transactional
+API (similar to `qpmulti_test` but without the consistency checking).
+And `qp-dump` is a utility for printing out the contents of a qp-trie.
+
+John Regehr has some nice essays about testing data structures:
+
+ * Levels of fuzzing: https://blog.regehr.org/archives/1039
+
+ (how much semantic knowledge does your fuzzer have?)
+
+ * Testing with small capacities: https://blog.regehr.org/archives/1138
+
+ (I need to be able to change the chunk size)
+
+ * Write fuzzable code: https://blog.regehr.org/archives/1687
+
+ * Oracles for random testing: https://blog.regehr.org/archives/856
+
+
+warning: generational collection
+--------------------------------
+
+The "generational hypothesis" is that most allocations have a short
+lifetime, so it is profitable for a garbage collector to split its
+heap into a number of generations. The youngest generation is where
+allocations happen; it typically uses a bump allocator, and when the
+allocation pointer reaches its limit, the youngest generation's
+contents are copied to the second generation. The hypothesis is that
+only a small fraction of the youngest generation will still be live
+when the GC runs, so this copy will not take much time or space.
+
+For a qp-trie the truth of this hypothesis depends on the order in
+which keys are added or removed. It may be true if there is good
+locality, for example, adding keys in lexicographic order, but not in
+general.
+
+When a qp-trie is mutated, only one node needs to be altered, near the
+leaf that is added or removed. Nodes near the root of the trie tend to
+be more stable and long-lived. However, during a copy-on-write
+transaction, the path from the root to an altered leaf must be copied,
+so nodes near the root are no longer stable and long-lived. They may
+become stable in a long transaction, but that isn't guaranteed.
+
+So the idea of generational garbage collection seems to be unhelpful
+for a qp-trie.
include/dns/order.h \
include/dns/peer.h \
include/dns/private.h \
+ include/dns/qp.h \
include/dns/rbt.h \
include/dns/rcode.h \
include/dns/rdata.h \
cache.c \
callbacks.c \
catz.c \
+ client.c \
clientinfo.c \
compress.c \
db.c \
order.c \
peer.c \
private.c \
+ qp.c \
+ qp_p.h \
rbt.c \
rbtdb.h \
rbtdb.c \
transport.c \
tkey.c \
tsig.c \
+ tsig_p.h \
ttl.c \
update.c \
validator.c \
view.c \
xfrin.c \
zone.c \
+ zone_p.h \
zoneverify.c \
zonekey.c \
- zt.c \
- client.c \
- tsig_p.h \
- zone_p.h
+ zt.c
if HAVE_GSSAPI
libdns_la_SOURCES += \
#define DNS_LOGMODULE_DYNDB (&dns_modules[30])
#define DNS_LOGMODULE_DNSTAP (&dns_modules[31])
#define DNS_LOGMODULE_SSU (&dns_modules[32])
+#define DNS_LOGMODULE_QP (&dns_modules[33])
ISC_LANG_BEGINDECLS
--- /dev/null
+/*
+ * Copyright (C) Internet Systems Consortium, Inc. ("ISC")
+ *
+ * SPDX-License-Identifier: MPL-2.0
+ *
+ * This Source Code Form is subject to the terms of the Mozilla Public
+ * License, v. 2.0. If a copy of the MPL was not distributed with this
+ * file, you can obtain one at https://mozilla.org/MPL/2.0/.
+ *
+ * See the COPYRIGHT file distributed with this work for additional
+ * information regarding copyright ownership.
+ */
+
+#pragma once
+
+/*
+ * A qp-trie is a kind of key -> value map, supporting lookups that are
+ * aware of the lexicographic order of keys.
+ *
+ * Keys are `dns_qpkey_t`, which is a string-like thing, usually created
+ * from a DNS name. You can use both relative and absolute DNS names as
+ * keys.
+ *
+ * Leaf values are a pair of a `void *` pointer and a `uint32_t`
+ * (because that is what fits inside an internal qp-trie leaf node).
+ *
+ * The trie does not store keys; instead keys are derived from leaf values
+ * by calling a method provided by the user.
+ *
+ * There are a few flavours of qp-trie.
+ *
+ * The basic `dns_qp_t` supports single-threaded read/write access.
+ *
+ * A `dns_qpmulti_t` is a wrapper that supports multithreaded access.
+ * There can be many concurrent readers and a single writer. Writes are
+ * transactional, and support multi-version concurrency.
+ *
+ * The concurrency strategy uses copy-on-write. When making changes during
+ * a transaction, the caller must not modify leaf values in place, but
+ * instead delete the old leaf from the trie and insert a replacement. Leaf
+ * values have reference counts, which will indicate when the old leaf
+ * value can be freed after it is no longer needed by readers using an old
+ * version of the trie.
+ *
+ * For fast concurrent reads, call `dns_qpmulti_query()` to get a
+ * `dns_qpread_t`. Readers can access a single version of the trie between
+ * write commits. Most write activity is not blocked by readers, but reads
+ * must finish before a write can commit (a read-write lock blocks
+ * commits).
+ *
+ * For long-running reads that need a stable view of the trie, while still
+ * allow commits to proceed, call `dns_qpmulti_snapshot()` to get a
+ * `dns_qpsnap_t`. It briefly gets the write mutex while creating the
+ * snapshot, which requires allocating a copy of some of the trie's
+ * metadata. A snapshot is for relatively heavy long-running read-only
+ * operations such as zone transfers.
+ *
+ * While snapshots exist, a qp-trie cannot reclaim memory: it does not
+ * retain detailed information about which memory is used by which
+ * snapshots, so it pessimistically retains all memory that might be
+ * used by old versions of the trie.
+ *
+ * You can start one read-write transaction at a time using
+ * `dns_qpmulti_write()` or `dns_qpmulti_update()`. Either way, you
+ * get a `dns_qp_t` that can be modified like a single-threaded trie,
+ * without affecting other read-only query or snapshot users of the
+ * `dns_qpmulti_t`. Committing a transaction only blocks readers
+ * briefly when flipping the active readonly `dns_qp_t` pointer.
+ *
+ * "Update" transactions are heavyweight. They allocate working memory to
+ * hold modifications to the trie, and compact the trie before committing.
+ * For extra space savings, a partially-used allocation chunk is shrunk to
+ * the smallest size possible. Unlike "write" transactions, an "update"
+ * transaction can be rolled back instead of committed. (Update
+ * transactions are intended for things like authoritative zones, where it
+ * is important to keep the per-trie memory overhead low because there can
+ * be a very large number of them.)
+ *
+ * "Write" transactions are more lightweight: they skip the allocation and
+ * compaction at the start and end of the transaction. (Write transactions
+ * are intended for frequent small changes, as in the DNS cache.)
+ */
+
+/***********************************************************************
+ *
+ * types
+ */
+
+#include <isc/attributes.h>
+
+#include <dns/types.h>
+
+/*%
+ * A `dns_qp_t` supports single-threaded read/write access.
+ */
+typedef struct dns_qp dns_qp_t;
+
+/*%
+ * A `dns_qpmulti_t` supports multi-version concurrent reads and transactional
+ * modification.
+ */
+typedef struct dns_qpmulti dns_qpmulti_t;
+
+/*%
+ * A `dns_qpread_t` is a lightweight read-only handle on a `dns_qpmulti_t`.
+ */
+typedef struct dns_qpread dns_qpread_t;
+
+/*%
+ * A `dns_qpsnap_t` is a heavier read-only snapshot of a `dns_qpmulti_t`.
+ */
+typedef struct dns_qpsnap dns_qpsnap_t;
+
+/*
+ * The read-only qp-trie functions can work on either of the read-only
+ * qp-trie types or the general-purpose read-write `dns_qp_t`. They
+ * relies on the fact that all the `dns_qpreadable_t` structures start
+ * with a `dns_qpread_t`.
+ */
+typedef union dns_qpreadable {
+ dns_qpread_t *qpr;
+ dns_qpsnap_t *qps;
+ dns_qp_t *qpt;
+} dns_qpreadable_t __attribute__((__transparent_union__));
+
+#define dns_qpreadable_cast(qp) ((qp).qpr)
+
+/*%
+ * A trie lookup key is a small array, allocated on the stack during trie
+ * searches. Keys are usually created on demand from DNS names using
+ * `dns_qpkey_fromname()`, but in principle you can define your own
+ * functions to convert other types to trie lookup keys.
+ *
+ * A domain name can be up to 255 bytes. When converted to a key, each
+ * character in the name corresponds to one byte in the key if it is a
+ * common hostname character; otherwise unusual characters are escaped,
+ * using two bytes in the key. So we allow keys to be up to 512 bytes.
+ * (The actual max is (255 - 5) * 2 + 6 == 506)
+ *
+ * Every byte of a key must be greater than 0 and less than 48. Elements
+ * after the end of the key are treated as having the value 1.
+ */
+typedef uint8_t dns_qpkey_t[512];
+
+/*%
+ * These leaf methods allow the qp-trie code to call back to the code
+ * responsible for the leaf values that are stored in the trie. The
+ * methods are provided for a whole trie when the trie is created.
+ *
+ * The qp-trie is also given a context pointer that is passed to the
+ * methods, so the methods know about the trie's context as well as a
+ * particular leaf value.
+ *
+ * The `attach` and `detach` methods adjust reference counts on value
+ * objects. They support copy-on-write and safe memory reclamation
+ * needed for multi-version concurrency.
+ *
+ * Note: When a value object reference count is greater than one, the
+ * object is in use by concurrent readers so it must not be modified. A
+ * refcount equal to one does not indicate whether or not the object is
+ * mutable: its refcount can be 1 while it is only in use by readers (and
+ * must be left unchanged), or newly created by a writer (and therefore
+ * mutable).
+ *
+ * The `makekey` method fills in a `dns_qpkey_t` corresponding to a
+ * value object stored in the qp-trie. It returns the length of the
+ * key. This method will typically call dns_qpkey_fromname() with a
+ * name stored in the value object.
+ *
+ * For logging and tracing, the `triename` method copies a human-
+ * readable identifier into `buf` which has max length `size`.
+ */
+typedef struct dns_qpmethods {
+ void (*attach)(void *ctx, void *pval, uint32_t ival);
+ void (*detach)(void *ctx, void *pval, uint32_t ival);
+ size_t (*makekey)(dns_qpkey_t key, void *ctx, void *pval,
+ uint32_t ival);
+ void (*triename)(void *ctx, char *buf, size_t size);
+} dns_qpmethods_t;
+
+/*%
+ * Buffers for use by the `triename()` method need to be large enough
+ * to hold a zone name and a few descriptive words.
+ */
+#define DNS_QP_TRIENAME_MAX 300
+
+/*%
+ * A container for the counters returned by `dns_qp_memusage()`
+ */
+typedef struct dns_qp_memusage {
+ void *ctx; /*%< qp-trie method context */
+ size_t leaves; /*%< values in the trie */
+ size_t live; /*%< nodes in use */
+ size_t used; /*%< allocated nodes */
+ size_t hold; /*%< nodes retained for readers */
+ size_t free; /*%< nodes to be reclaimed */
+ size_t node_size; /*%< in bytes */
+ size_t chunk_size; /*%< nodes per chunk */
+ size_t chunk_count; /*%< allocated chunks */
+ size_t bytes; /*%< total memory in chunks and metadata */
+} dns_qp_memusage_t;
+
+/***********************************************************************
+ *
+ * functions - create, destory, enquire
+ */
+
+void
+dns_qp_create(isc_mem_t *mctx, const dns_qpmethods_t *methods, void *ctx,
+ dns_qp_t **qptp);
+/*%<
+ * Create a single-threaded qp-trie.
+ *
+ * Requires:
+ * \li `mctx` is a pointer to a valid memory context.
+ * \li all the methods are non-NULL
+ * \li `qptp != NULL && *qptp == NULL`
+ *
+ * Ensures:
+ * \li `*qptp` is a pointer to a valid single-threaded qp-trie
+ */
+
+void
+dns_qp_destroy(dns_qp_t **qptp);
+/*%<
+ * Destroy a single-threaded qp-trie.
+ *
+ * Requires:
+ * \li `qptp != NULL`
+ * \li `*qptp` is a pointer to a valid single-threaded qp-trie
+ *
+ * Ensures:
+ * \li all memory allocated by the qp-trie has been released
+ * \li `*qptp` is NULL
+ */
+
+void
+dns_qpmulti_create(isc_mem_t *mctx, const dns_qpmethods_t *methods, void *ctx,
+ dns_qpmulti_t **qpmp);
+/*%<
+ * Create a multi-threaded qp-trie.
+ *
+ * Requires:
+ * \li `mctx` is a pointer to a valid memory context.
+ * \li all the methods are non-NULL
+ * \li `qpmp != NULL && *qpmp == NULL`
+ *
+ * Ensures:
+ * \li `*qpmp` is a pointer to a valid multi-threaded qp-trie
+ */
+
+void
+dns_qpmulti_destroy(dns_qpmulti_t **qpmp);
+/*%<
+ * Destroy a multi-threaded qp-trie.
+ *
+ * Requires:
+ * \li `qptp != NULL`
+ * \li `*qptp` is a pointer to a valid multi-threaded qp-trie
+ * \li there are no write or update transactions in progress
+ * \li no snapshots exist
+ *
+ * Ensures:
+ * \li all memory allocated by the qp-trie has been released
+ * \li `*qpmp` is NULL
+ */
+
+void
+dns_qp_compact(dns_qp_t *qp);
+/*%<
+ * Defragment the entire qp-trie and release unused memory.
+ *
+ * When modifications make a trie too fragmented, it is automatically
+ * compacted. Automatic compaction avoids compacting chunks that are not
+ * fragmented to save time, but this function compacts the entire trie to
+ * defragment it as much as possible.
+ *
+ * This function can be used with a single-threaded qp-trie and during a
+ * transaction on a multi-threaded trie.
+ *
+ * Requires:
+ * \li `qp` is a pointer to a valid qp-trie
+ */
+
+void
+dns_qp_gctime(uint64_t *compact_us, uint64_t *recover_us,
+ uint64_t *rollback_us);
+/*%<
+ * Get the total times spent on garbage collection in microseconds.
+ *
+ * These counters are global, covering every qp-trie in the program.
+ *
+ * XXXFANF This is a placeholder until we can record times in histograms.
+ */
+
+dns_qp_memusage_t
+dns_qp_memusage(dns_qp_t *qp);
+/*%<
+ * Get the memory counters from a qp-trie
+ *
+ * Requires:
+ * \li `qp` is a pointer to a valid qp-trie
+ *
+ * Returns:
+ * \li a `dns_qp_memusage_t` structure described above
+ */
+
+/***********************************************************************
+ *
+ * functions - search, modify
+ */
+
+/*
+ * XXXFANF todo, based on what we discover BIND needs
+ *
+ * fancy searches: longest match, lexicographic predecessor,
+ * etc.
+ *
+ * do we need specific lookup functions to find out if the
+ * returned value is readonly or mutable?
+ *
+ * richer modification such as dns_qp_replace{key,name}
+ *
+ * iteration - probably best to put an explicit stack in the iterator,
+ * cf. rbtnodechain
+ */
+
+size_t
+dns_qpkey_fromname(dns_qpkey_t key, const dns_name_t *name);
+/*%<
+ * Convert a DNS name into a trie lookup key.
+ *
+ * Requires:
+ * \li `name` is a pointer to a valid `dns_name_t`
+ *
+ * Returns:
+ * \li the length of the key
+ */
+
+isc_result_t
+dns_qp_getkey(dns_qpreadable_t qpr, const dns_qpkey_t searchk, size_t searchl,
+ void **pval_r, uint32_t *ival_r);
+/*%<
+ * Find a leaf in a qp-trie that matches the given key
+ *
+ * The leaf values are assigned to `*pval_r` and `*ival_r`
+ *
+ * Requires:
+ * \li `qpr` is a pointer to a readable qp-trie
+ * \li `pval_r != NULL`
+ * \li `ival_r != NULL`
+ *
+ * Returns:
+ * \li ISC_R_NOTFOUND if the trie has no leaf with a matching key
+ * \li ISC_R_SUCCESS if the leaf was found
+ */
+
+isc_result_t
+dns_qp_getname(dns_qpreadable_t qpr, const dns_name_t *name, void **pval_r,
+ uint32_t *ival_r);
+/*%<
+ * Find a leaf in a qp-trie that matches the given DNS name
+ *
+ * The leaf values are assigned to `*pval_r` and `*ival_r`
+ *
+ * Requires:
+ * \li `qpr` is a pointer to a readable qp-trie
+ * \li `name` is a pointer to a valid `dns_name_t`
+ * \li `pval_r != NULL`
+ * \li `ival_r != NULL`
+ *
+ * Returns:
+ * \li ISC_R_NOTFOUND if the trie has no leaf with a matching key
+ * \li ISC_R_SUCCESS if the leaf was found
+ */
+
+isc_result_t
+dns_qp_insert(dns_qp_t *qp, void *pval, uint32_t ival);
+/*%<
+ * Insert a leaf into a qp-trie
+ *
+ * Requires:
+ * \li `qp` is a pointer to a valid qp-trie
+ * \li `pval != NULL`
+ * \li `alignof(pval) > 1`
+ *
+ * Returns:
+ * \li ISC_R_EXISTS if the trie already has a leaf with the same key
+ * \li ISC_R_SUCCESS if the leaf was added to the trie
+ */
+
+isc_result_t
+dns_qp_deletekey(dns_qp_t *qp, const dns_qpkey_t key, size_t len);
+/*%<
+ * Delete a leaf from a qp-trie that matches the given key
+ *
+ * Requires:
+ * \li `qp` is a pointer to a valid qp-trie
+ *
+ * Returns:
+ * \li ISC_R_NOTFOUND if the trie has no leaf with a matching key
+ * \li ISC_R_SUCCESS if the leaf was deleted from the trie
+ */
+
+isc_result_t
+dns_qp_deletename(dns_qp_t *qp, const dns_name_t *name);
+/*%<
+ * Delete a leaf from a qp-trie that matches the given DNS name
+ *
+ * Requires:
+ * \li `qp` is a pointer to a valid qp-trie
+ * \li `name` is a pointer to a valid qp-trie
+ *
+ * Returns:
+ * \li ISC_R_NOTFOUND if the trie has no leaf with a matching name
+ * \li ISC_R_SUCCESS if the leaf was deleted from the trie
+ */
+
+/***********************************************************************
+ *
+ * functions - transactions
+ */
+
+void
+dns_qpmulti_query(dns_qpmulti_t *multi, dns_qpread_t **qprp);
+/*%<
+ * Start a lightweight (brief) read-only transaction
+ *
+ * This takes a read lock on `multi`s rwlock that prevents
+ * transactions from committing.
+ *
+ * Requires:
+ * \li `multi` is a pointer to a valid multi-threaded qp-trie
+ * \li `qprp != NULL`
+ * \li `*qprp == NULL`
+ *
+ * Returns:
+ * \li `*qprp` is a pointer to a valid read-only qp-trie handle
+ */
+
+void
+dns_qpread_destroy(dns_qpmulti_t *multi, dns_qpread_t **qprp);
+/*%<
+ * End a lightweight read transaction, i.e. release read lock
+ *
+ * Requires:
+ * \li `multi` is a pointer to a valid multi-threaded qp-trie
+ * \li `qprp != NULL`
+ * \li `*qprp` is a read-only qp-trie handle obtained from `multi`
+ *
+ * Returns:
+ * \li `*qprp == NULL`
+ */
+
+void
+dns_qpmulti_snapshot(dns_qpmulti_t *multi, dns_qpsnap_t **qpsp);
+/*%<
+ * Start a heavyweight (long) read-only transaction
+ *
+ * This function briefly takes and releases the modification mutex
+ * while allocating a copy of the trie's metadata. While the snapshot
+ * exists it does not interfere with other read-only or read-write
+ * transactions on the trie, except that memory cannot be reclaimed.
+ *
+ * Requires:
+ * \li `multi` is a pointer to a valid multi-threaded qp-trie
+ * \li `qpsp != NULL`
+ * \li `*qpsp == NULL`
+ *
+ * Returns:
+ * \li `*qpsp` is a pointer to a snapshot obtained from `multi`
+ */
+
+void
+dns_qpsnap_destroy(dns_qpmulti_t *multi, dns_qpsnap_t **qpsp);
+/*%<
+ * End a heavyweight read transaction
+ *
+ * If this is the last remaining snapshot belonging to `multi` then
+ * this function takes the modification mutex in order to free() any
+ * memory that is no longer in use.
+ *
+ * Requires:
+ * \li `multi` is a pointer to a valid multi-threaded qp-trie
+ * \li `qpsp != NULL`
+ * \li `*qpsp` is a pointer to a snapshot obtained from `multi`
+ *
+ * Returns:
+ * \li `*qpsp == NULL`
+ */
+
+void
+dns_qpmulti_update(dns_qpmulti_t *multi, dns_qp_t **qptp);
+/*%<
+ * Start a heavyweight write transaction
+ *
+ * This style of transaction allocates a copy of the trie's metadata to
+ * support rollback, and it aims to minimize the memory usage of the
+ * trie between transactions. The trie is compacted when the transaction
+ * commits, and any partly-used chunk is shrunk to fit.
+ *
+ * During the transaction, the modification mutex is held.
+ *
+ * Requires:
+ * \li `multi` is a pointer to a valid multi-threaded qp-trie
+ * \li `qptp != NULL`
+ * \li `*qptp == NULL`
+ *
+ * Returns:
+ * \li `*qptp` is a pointer to the modifiable qp-trie inside `multi`
+ */
+
+void
+dns_qpmulti_write(dns_qpmulti_t *multi, dns_qp_t **qptp);
+/*%<
+ * Start a lightweight write transaction
+ *
+ * This style of transaction does not need extra allocations in addition
+ * to the ones required by insert and delete operations. It is intended
+ * for a large trie that gets frequent small writes, such as a DNS
+ * cache.
+ *
+ * During the transaction, the modification mutex is held.
+ *
+ * Requires:
+ * \li `multi` is a pointer to a valid multi-threaded qp-trie
+ * \li `qptp != NULL`
+ * \li `*qptp == NULL`
+ *
+ * Returns:
+ * \li `*qptp` is a pointer to the modifiable qp-trie inside `multi`
+ */
+
+void
+dns_qpmulti_commit(dns_qpmulti_t *multi, dns_qp_t **qptp);
+/*%<
+ * Complete a modification transaction
+ *
+ * The commit itself only requires flipping the read pointer inside
+ * `multi` from the old version of the trie to the new version. This
+ * function takes a write lock on `multi`s rwlock just long enough to
+ * flip the pointer. This briefly blocks `query` readers.
+ *
+ * This function releases the modification mutex after the post-commit
+ * memory reclamation is completed.
+ *
+ * Requires:
+ * \li `multi` is a pointer to a valid multi-threaded qp-trie
+ * \li `qptp != NULL`
+ * \li `*qptp` is a pointer to the modifiable qp-trie inside `multi`
+ *
+ * Returns:
+ * \li `*qptp == NULL`
+ */
+
+void
+dns_qpmulti_rollback(dns_qpmulti_t *multi, dns_qp_t **qptp);
+/*%<
+ * Abandon an update transaction
+ *
+ * This function reclaims the memory allocated during the transaction
+ * and releases the modification mutex.
+ *
+ * Requires:
+ * \li `multi` is a pointer to a valid multi-threaded qp-trie
+ * \li `qptp != NULL`
+ * \li `*qptp` is a pointer to the modifiable qp-trie inside `multi`
+ * \li `*qptp` was obtained from `dns_qpmulti_update()`
+ *
+ * Returns:
+ * \li `*qptp == NULL`
+ */
+
+/**********************************************************************/
* \#define to <dns/log.h>.
*/
isc_logmodule_t dns_modules[] = {
- { "dns/db", 0 }, { "dns/rbtdb", 0 },
- { "dns/rbt", 0 }, { "dns/rdata", 0 },
- { "dns/master", 0 }, { "dns/message", 0 },
- { "dns/cache", 0 }, { "dns/config", 0 },
- { "dns/resolver", 0 }, { "dns/zone", 0 },
- { "dns/journal", 0 }, { "dns/adb", 0 },
- { "dns/xfrin", 0 }, { "dns/xfrout", 0 },
- { "dns/acl", 0 }, { "dns/validator", 0 },
- { "dns/dispatch", 0 }, { "dns/request", 0 },
- { "dns/masterdump", 0 }, { "dns/tsig", 0 },
- { "dns/tkey", 0 }, { "dns/sdb", 0 },
- { "dns/diff", 0 }, { "dns/hints", 0 },
- { "dns/unused1", 0 }, { "dns/dlz", 0 },
- { "dns/dnssec", 0 }, { "dns/crypto", 0 },
- { "dns/packets", 0 }, { "dns/nta", 0 },
- { "dns/dyndb", 0 }, { "dns/dnstap", 0 },
- { "dns/ssu", 0 }, { NULL, 0 }
+ { "dns/db", 0 }, { "dns/rbtdb", 0 }, { "dns/rbt", 0 },
+ { "dns/rdata", 0 }, { "dns/master", 0 }, { "dns/message", 0 },
+ { "dns/cache", 0 }, { "dns/config", 0 }, { "dns/resolver", 0 },
+ { "dns/zone", 0 }, { "dns/journal", 0 }, { "dns/adb", 0 },
+ { "dns/xfrin", 0 }, { "dns/xfrout", 0 }, { "dns/acl", 0 },
+ { "dns/validator", 0 }, { "dns/dispatch", 0 }, { "dns/request", 0 },
+ { "dns/masterdump", 0 }, { "dns/tsig", 0 }, { "dns/tkey", 0 },
+ { "dns/sdb", 0 }, { "dns/diff", 0 }, { "dns/hints", 0 },
+ { "dns/unused1", 0 }, { "dns/dlz", 0 }, { "dns/dnssec", 0 },
+ { "dns/crypto", 0 }, { "dns/packets", 0 }, { "dns/nta", 0 },
+ { "dns/dyndb", 0 }, { "dns/dnstap", 0 }, { "dns/ssu", 0 },
+ { "dns/qp", 0 }, { NULL, 0 },
};
isc_log_t *dns_lctx = NULL;
--- /dev/null
+/*
+ * Copyright (C) Internet Systems Consortium, Inc. ("ISC")
+ *
+ * SPDX-License-Identifier: MPL-2.0
+ *
+ * This Source Code Form is subject to the terms of the Mozilla Public
+ * License, v. 2.0. If a copy of the MPL was not distributed with this
+ * file, you can obtain one at https://mozilla.org/MPL/2.0/.
+ *
+ * See the COPYRIGHT file distributed with this work for additional
+ * information regarding copyright ownership.
+ */
+
+/*
+ * For an overview, see doc/design/qp-trie.md
+ */
+
+#include <inttypes.h>
+#include <stdbool.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <string.h>
+
+#if FUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION
+#include <sys/mman.h>
+#include <unistd.h>
+#endif
+
+#include <isc/atomic.h>
+#include <isc/buffer.h>
+#include <isc/magic.h>
+#include <isc/mem.h>
+#include <isc/mutex.h>
+#include <isc/refcount.h>
+#include <isc/result.h>
+#include <isc/rwlock.h>
+#include <isc/time.h>
+#include <isc/types.h>
+#include <isc/util.h>
+
+#include <dns/log.h>
+#include <dns/name.h>
+#include <dns/qp.h>
+#include <dns/types.h>
+
+#include "qp_p.h"
+
+/*
+ * very basic garbage collector statistics
+ *
+ * XXXFANF for now we're logging GC times, but ideally we should
+ * accumulate stats more quietly and report via the statschannel
+ */
+static atomic_uint_fast64_t compact_time;
+static atomic_uint_fast64_t recycle_time;
+static atomic_uint_fast64_t rollback_time;
+
+#if 1
+#define QP_LOG_STATS(...) \
+ isc_log_write(dns_lctx, DNS_LOGCATEGORY_DATABASE, DNS_LOGMODULE_QP, \
+ ISC_LOG_DEBUG(1), __VA_ARGS__)
+#else
+#define QP_LOG_STATS(...)
+#endif
+
+#define PRItime " %" PRIu64 " us "
+
+#if 0
+/*
+ * QP_TRACE is generally used in allocation-related functions so it doesn't
+ * trace very high-frequency ops
+ */
+#define QP_TRACE(fmt, ...) \
+ if (isc_log_wouldlog(dns_lctx, ISC_LOG_DEBUG(7))) { \
+ isc_log_write(dns_lctx, DNS_LOGCATEGORY_DATABASE, \
+ DNS_LOGMODULE_QP, ISC_LOG_DEBUG(7), \
+ "%s:%d:%s(qp %p ctx \"%s\" gen %u): " fmt, \
+ __FILE__, __LINE__, __func__, qp, TRIENAME(qp), \
+ qp->generation, ##__VA_ARGS__); \
+ } else \
+ do { \
+ } while (0)
+#else
+#define QP_TRACE(...)
+#endif
+
+/***********************************************************************
+ *
+ * converting DNS names to trie keys
+ */
+
+/*
+ * Number of distinct byte values, i.e. 256
+ */
+#define BYTE_VALUES (UINT8_MAX + 1)
+
+/*
+ * Lookup table mapping bytes in DNS names to bit positions, used
+ * by dns_qpkey_fromname() to convert DNS names to qp-trie keys.
+ *
+ * Each element holds one or two bit positions, bit_one in the
+ * lower half and bit_two in the upper half.
+ *
+ * For common hostname characters, bit_two is zero (which cannot
+ * be a valid bit position).
+ *
+ * For others, bit_one is the escape bit, and bit_two is the
+ * position of the character within the escaped range.
+ */
+uint16_t dns_qp_bits_for_byte[BYTE_VALUES] = { 0 };
+
+/*
+ * And the reverse, mapping bit positions to characters, so the tests
+ * can print diagnostics involving qp-trie keys.
+ *
+ * This table only handles the first bit in an escape sequence; we
+ * arrange that we can calculate the byte value for both bits by
+ * adding the the second bit to the first bit's byte value.
+ */
+uint8_t dns_qp_byte_for_bit[SHIFT_OFFSET] = { 0 };
+
+/*
+ * Fill in the lookup tables at program startup. (It doesn't matter
+ * when this is initialized relative to other startup code.)
+ */
+static void
+initialize_bits_for_byte(void) ISC_CONSTRUCTOR;
+
+/*
+ * The bit positions have to be between SHIFT_BITMAP and SHIFT_OFFSET.
+ *
+ * Each byte range in between common hostname characters has a different
+ * escape character, to preserve the correct lexical order.
+ *
+ * Escaped byte ranges mostly fit into the space available in the
+ * bitmap, except for those above 'z' (which is mostly bytes with the
+ * top bit set). So, when we reach the end of the bitmap we roll over
+ * to the next escape character.
+ *
+ * After filling the table we ensure that the bit positions for
+ * hostname characters and escape characters all fit.
+ */
+static void
+initialize_bits_for_byte(void) {
+ /* zero common character marker not a valid shift position */
+ INSIST(0 < SHIFT_BITMAP);
+ /* first bit is common byte or escape byte */
+ qp_shift_t bit_one = SHIFT_BITMAP;
+ /* second bit is position in escaped range */
+ qp_shift_t bit_two = SHIFT_BITMAP;
+ bool escaping = true;
+
+ for (unsigned int byte = 0; byte < BYTE_VALUES; byte++) {
+ if (qp_common_character(byte)) {
+ escaping = false;
+ bit_one++;
+ dns_qp_byte_for_bit[bit_one] = byte;
+ dns_qp_bits_for_byte[byte] = bit_one;
+ } else if ('A' <= byte && byte <= 'Z') {
+ /* map upper case to lower case */
+ qp_shift_t after_esc = bit_one + 1;
+ qp_shift_t skip_punct = 'a' - '_';
+ qp_shift_t letter = byte - 'A';
+ qp_shift_t bit = after_esc + skip_punct + letter;
+ dns_qp_bits_for_byte[byte] = bit;
+ /* to simplify reverse conversion in the tests */
+ bit_two++;
+ } else {
+ /* non-hostname characters need to be escaped */
+ if (!escaping || bit_two >= SHIFT_OFFSET) {
+ escaping = true;
+ bit_one++;
+ dns_qp_byte_for_bit[bit_one] = byte;
+ bit_two = SHIFT_BITMAP;
+ }
+ dns_qp_bits_for_byte[byte] = bit_two << 8 | bit_one;
+ bit_two++;
+ }
+ }
+ ENSURE(bit_one < SHIFT_OFFSET);
+}
+
+/*
+ * Convert a DNS name into a trie lookup key.
+ *
+ * Returns the length of the key.
+ *
+ * For performance we get our hands dirty in the guts of the name.
+ *
+ * We don't worry about the distinction between absolute and relative
+ * names. When the trie is only used with absolute names, the first byte
+ * of the key will always be SHIFT_NOBYTE and it will always be skipped
+ * when traversing the trie. So keeping the root label costs little, and
+ * it allows us to support tries of relative names too. In fact absolute
+ * and relative names can be mixed in the same trie without causing
+ * confusion, because the presence or absence of the initial
+ * SHIFT_NOBYTE in the key disambiguates them (exactly like a trailing
+ * dot in a zone file).
+ */
+size_t
+dns_qpkey_fromname(dns_qpkey_t key, const dns_name_t *name) {
+ size_t len, label;
+
+ REQUIRE(ISC_MAGIC_VALID(name, DNS_NAME_MAGIC));
+ REQUIRE(name->offsets != NULL);
+ REQUIRE(name->labels > 0);
+
+ len = 0;
+ label = name->labels;
+ while (label-- > 0) {
+ const uint8_t *ldata = name->ndata + name->offsets[label];
+ size_t label_len = *ldata++;
+ while (label_len-- > 0) {
+ uint16_t bits = dns_qp_bits_for_byte[*ldata++];
+ key[len++] = bits & 0xFF; /* bit_one */
+ if ((bits >> 8) != 0) { /* escape? */
+ key[len++] = bits >> 8; /* bit_two */
+ }
+ }
+ /* label terminator */
+ key[len++] = SHIFT_NOBYTE;
+ }
+ /* mark end with a double NOBYTE */
+ key[len] = SHIFT_NOBYTE;
+ return (len);
+}
+
+/*
+ * Sentinel value for equal keys
+ */
+#define QPKEY_EQUAL (~(size_t)0)
+
+/*
+ * Compare two keys and return the offset where they differ.
+ *
+ * This offset is used to work out where a trie search diverged: when one
+ * of the keys is in the trie and one is not, the common prefix (up to the
+ * offset) is the part of the unknown key that exists in the trie. This
+ * matters for adding new keys or finding neighbours of missing keys.
+ *
+ * When the keys are different lengths it is possible (but unwise) for
+ * the longer key to be the same as the shorter key but with superfluous
+ * trailing SHIFT_NOBYTE elements. This makes the keys equal for the
+ * purpose of traversing the trie.
+ */
+static size_t
+qpkey_compare(const dns_qpkey_t key_a, const size_t keylen_a,
+ const dns_qpkey_t key_b, const size_t keylen_b) {
+ size_t keylen = ISC_MAX(keylen_a, keylen_b);
+ for (size_t offset = 0; offset < keylen; offset++) {
+ if (qpkey_bit(key_a, keylen_a, offset) !=
+ qpkey_bit(key_b, keylen_b, offset))
+ {
+ return (offset);
+ }
+ }
+ return (QPKEY_EQUAL);
+}
+
+/***********************************************************************
+ *
+ * allocator wrappers
+ */
+
+#if FUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION
+
+/*
+ * Optionally (for debugging) during a copy-on-write transaction, use
+ * memory protection to ensure that the shared chunks are not modified.
+ * Once a chunk becomes shared, it remains read-only until it is freed.
+ * POSIX says we have to use mmap() to get an allocation that we can
+ * definitely pass to mprotect().
+ */
+
+static size_t
+chunk_size_raw(void) {
+ size_t size = (size_t)sysconf(_SC_PAGE_SIZE);
+ return (ISC_MAX(size, QP_CHUNK_BYTES));
+}
+
+static void *
+chunk_get_raw(dns_qp_t *qp) {
+ if (qp->write_protect) {
+ size_t size = chunk_size_raw();
+ void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
+ MAP_ANON | MAP_PRIVATE, -1, 0);
+ RUNTIME_CHECK(ptr != MAP_FAILED);
+ return (ptr);
+ } else {
+ return (isc_mem_allocate(qp->mctx, QP_CHUNK_BYTES));
+ }
+}
+
+static void
+chunk_free_raw(dns_qp_t *qp, void *ptr) {
+ if (qp->write_protect) {
+ RUNTIME_CHECK(munmap(ptr, chunk_size_raw()) == 0);
+ } else {
+ isc_mem_free(qp->mctx, ptr);
+ }
+}
+
+static void *
+chunk_shrink_raw(dns_qp_t *qp, void *ptr, size_t bytes) {
+ if (qp->write_protect) {
+ return (ptr);
+ } else {
+ return (isc_mem_reallocate(qp->mctx, ptr, bytes));
+ }
+}
+
+static void
+write_protect(dns_qp_t *qp, void *ptr, bool readonly) {
+ if (qp->write_protect) {
+ int prot = readonly ? PROT_READ : PROT_READ | PROT_WRITE;
+ size_t size = chunk_size_raw();
+ RUNTIME_CHECK(mprotect(ptr, size, prot) >= 0);
+ }
+}
+
+static void
+write_protect_all(dns_qp_t *qp) {
+ for (qp_chunk_t chunk = 0; chunk < qp->chunk_max; chunk++) {
+ if (chunk != qp->bump && qp->base[chunk] != NULL) {
+ write_protect(qp, qp->base[chunk], true);
+ }
+ }
+}
+
+#else
+
+#define chunk_get_raw(qp) isc_mem_allocate(qp->mctx, QP_CHUNK_BYTES)
+#define chunk_free_raw(qp, ptr) isc_mem_free(qp->mctx, ptr)
+
+#define chunk_shrink_raw(qp, ptr, size) isc_mem_reallocate(qp->mctx, ptr, size)
+
+#define write_protect(qp, chunk, readonly)
+#define write_protect_all(qp)
+
+#endif
+
+static void *
+clone_array(isc_mem_t *mctx, void *oldp, size_t oldsz, size_t newsz,
+ size_t elemsz) {
+ uint8_t *newp = NULL;
+
+ INSIST(oldsz <= newsz);
+ INSIST(newsz < UINT32_MAX);
+ INSIST(elemsz < UINT32_MAX);
+ INSIST(((uint64_t)newsz) * ((uint64_t)elemsz) <= UINT32_MAX);
+
+ /* sometimes we clone an array before it has been populated */
+ if (newsz > 0) {
+ oldsz *= elemsz;
+ newsz *= elemsz;
+ newp = isc_mem_allocate(mctx, newsz);
+ if (oldsz > 0) {
+ memmove(newp, oldp, oldsz);
+ }
+ memset(newp + oldsz, 0, newsz - oldsz);
+ }
+ return (newp);
+}
+
+/***********************************************************************
+ *
+ * allocator
+ */
+
+/*
+ * How many cells are actually in use in a chunk?
+ */
+static inline qp_cell_t
+chunk_usage(dns_qp_t *qp, qp_chunk_t chunk) {
+ return (qp->usage[chunk].used - qp->usage[chunk].free);
+}
+
+/*
+ * Is this chunk wasting space?
+ */
+static inline qp_cell_t
+chunk_fragmented(dns_qp_t *qp, qp_chunk_t chunk) {
+ return (qp->usage[chunk].free > QP_MAX_FREE);
+}
+
+/*
+ * We can mutate a chunk if it was allocated in the current generation.
+ * This might not be true for the `bump` chunk when it is reused.
+ */
+static inline bool
+chunk_mutable(dns_qp_t *qp, qp_chunk_t chunk) {
+ return (qp->usage[chunk].generation == qp->generation);
+}
+
+/*
+ * When we reuse the bump chunk across multiple write transactions,
+ * it can have an immutable prefix and a mutable suffix.
+ */
+static inline bool
+twigs_mutable(dns_qp_t *qp, qp_ref_t ref) {
+ qp_chunk_t chunk = ref_chunk(ref);
+ qp_cell_t cell = ref_cell(ref);
+ if (chunk == qp->bump) {
+ return (cell >= qp->fender);
+ } else {
+ return (chunk_mutable(qp, chunk));
+ }
+}
+
+/*
+ * Create a fresh bump chunk and allocate some twigs from it.
+ */
+static qp_ref_t
+chunk_alloc(dns_qp_t *qp, qp_chunk_t chunk, qp_weight_t size) {
+ REQUIRE(qp->base[chunk] == NULL);
+ REQUIRE(qp->usage[chunk].generation == 0);
+ REQUIRE(qp->usage[chunk].used == 0);
+ REQUIRE(qp->usage[chunk].free == 0);
+
+ qp->base[chunk] = chunk_get_raw(qp);
+ qp->usage[chunk].generation = qp->generation;
+ qp->usage[chunk].used = size;
+ qp->usage[chunk].free = 0;
+ qp->used_count += size;
+ qp->bump = chunk;
+ qp->fender = 0;
+
+ QP_TRACE("chunk %u gen %u base %p", chunk, qp->usage[chunk].generation,
+ qp->base[chunk]);
+ return (make_ref(chunk, 0));
+}
+
+static void
+free_chunk_arrays(dns_qp_t *qp) {
+ QP_TRACE("base %p usage %p max %u", qp->base, qp->usage, qp->chunk_max);
+ /*
+ * They should both be null or both non-null; if they are out of sync,
+ * this will intentionally trigger an assert in `isc_mem_free()`.
+ */
+ if (qp->base != NULL || qp->usage != NULL) {
+ isc_mem_free(qp->mctx, qp->base);
+ isc_mem_free(qp->mctx, qp->usage);
+ }
+}
+
+/*
+ * This is used both to grow the arrays when they fill up, and to copy them at
+ * the start of an update transaction. We check if the old arrays are in use by
+ * readers, in which case we will do safe memory reclamation later.
+ */
+static void
+clone_chunk_arrays(dns_qp_t *qp, qp_chunk_t newmax) {
+ qp_chunk_t oldmax;
+ void *base, *usage;
+
+ oldmax = qp->chunk_max;
+ qp->chunk_max = newmax;
+
+ base = clone_array(qp->mctx, qp->base, oldmax, newmax,
+ sizeof(*qp->base));
+ usage = clone_array(qp->mctx, qp->usage, oldmax, newmax,
+ sizeof(*qp->usage));
+
+ if (qp->shared_arrays) {
+ qp->shared_arrays = false;
+ } else {
+ free_chunk_arrays(qp);
+ }
+ qp->base = base;
+ qp->usage = usage;
+
+ QP_TRACE("base %p usage %p max %u", qp->base, qp->usage, qp->chunk_max);
+}
+
+/*
+ * There was no space in the bump chunk, so find a place to put a fresh
+ * chunk in the chunk table, then allocate some twigs from it.
+ */
+static qp_ref_t
+alloc_slow(dns_qp_t *qp, qp_weight_t size) {
+ qp_chunk_t chunk;
+
+ for (chunk = 0; chunk < qp->chunk_max; chunk++) {
+ if (qp->base[chunk] == NULL) {
+ return (chunk_alloc(qp, chunk, size));
+ }
+ }
+ ENSURE(chunk == qp->chunk_max);
+ clone_chunk_arrays(qp, GROWTH_FACTOR(chunk));
+ return (chunk_alloc(qp, chunk, size));
+}
+
+/*
+ * Ensure we are using a fresh bump chunk.
+ */
+static void
+alloc_reset(dns_qp_t *qp) {
+ (void)alloc_slow(qp, 0);
+}
+
+/*
+ * Allocate some fresh twigs. This is the bump allocator fast path.
+ */
+static inline qp_ref_t
+alloc_twigs(dns_qp_t *qp, qp_weight_t size) {
+ qp_chunk_t chunk = qp->bump;
+ qp_cell_t cell = qp->usage[chunk].used;
+ if (cell + size <= QP_CHUNK_SIZE) {
+ qp->usage[chunk].used += size;
+ qp->used_count += size;
+ return (make_ref(chunk, cell));
+ } else {
+ return (alloc_slow(qp, size));
+ }
+}
+
+/*
+ * Record that some twigs are no longer being used, and if possible
+ * zero them to ensure that there isn't a spurious double detach when
+ * the chunk is later recycled.
+ *
+ * NOTE: the caller is responsible for attaching or detaching any
+ * leaves as required.
+ */
+static inline void
+free_twigs(dns_qp_t *qp, qp_ref_t twigs, qp_weight_t size) {
+ qp_chunk_t chunk = ref_chunk(twigs);
+
+ qp->free_count += size;
+ qp->usage[chunk].free += size;
+ ENSURE(qp->free_count <= qp->used_count);
+ ENSURE(qp->usage[chunk].free <= qp->usage[chunk].used);
+
+ if (twigs_mutable(qp, twigs)) {
+ zero_twigs(ref_ptr(qp, twigs), size);
+ } else {
+ qp->hold_count += size;
+ ENSURE(qp->free_count >= qp->hold_count);
+ }
+}
+
+/***********************************************************************
+ *
+ * chunk reclamation
+ */
+
+/*
+ * When a chunk is being recycled after a long-running read transaction,
+ * or after a rollback, we need to detach any leaves that remain.
+ */
+static void
+chunk_free(dns_qp_t *qp, qp_chunk_t chunk) {
+ QP_TRACE("chunk %u gen %u base %p", chunk, qp->usage[chunk].generation,
+ qp->base[chunk]);
+
+ qp_node_t *n = qp->base[chunk];
+ write_protect(qp, n, false);
+
+ for (qp_cell_t count = qp->usage[chunk].used; count > 0; count--, n++) {
+ if (!is_branch(n) && leaf_pval(n) != NULL) {
+ detach_leaf(qp, n);
+ }
+ }
+ chunk_free_raw(qp, qp->base[chunk]);
+
+ INSIST(qp->used_count >= qp->usage[chunk].used);
+ INSIST(qp->free_count >= qp->usage[chunk].free);
+ qp->used_count -= qp->usage[chunk].used;
+ qp->free_count -= qp->usage[chunk].free;
+ qp->usage[chunk].used = 0;
+ qp->usage[chunk].free = 0;
+ qp->usage[chunk].generation = 0;
+ qp->base[chunk] = NULL;
+}
+
+/*
+ * If we have any nodes on hold during a transaction, we must leave
+ * immutable chunks intact. As the last stage of safe memory reclamation,
+ * we can clear the hold counter and recycle all empty chunks (even from a
+ * nominally read-only `dns_qp_t`) because nothing refers to them any more.
+ *
+ * If we are using RCU, this can be called by `defer_rcu()` or `call_rcu()`
+ * to clean up after readers have left their critical sections.
+ */
+static void
+recycle(dns_qp_t *qp) {
+ isc_time_t t0, t1;
+ uint64_t time;
+ unsigned int live = 0;
+ unsigned int keep = 0;
+ unsigned int free = 0;
+
+ QP_TRACE("expect to free %u cells -> %u chunks",
+ (qp->free_count - qp->hold_count),
+ (qp->free_count - qp->hold_count) / QP_CHUNK_SIZE);
+
+ isc_time_now_hires(&t0);
+
+ for (qp_chunk_t chunk = 0; chunk < qp->chunk_max; chunk++) {
+ if (qp->base[chunk] == NULL) {
+ continue;
+ } else if (chunk == qp->bump || chunk_usage(qp, chunk) > 0) {
+ live++;
+ } else if (chunk_mutable(qp, chunk) || qp->hold_count == 0) {
+ chunk_free(qp, chunk);
+ free++;
+ } else {
+ keep++;
+ }
+ }
+
+ isc_time_now_hires(&t1);
+ time = isc_time_microdiff(&t1, &t0);
+ atomic_fetch_add_relaxed(&recycle_time, time);
+
+ QP_LOG_STATS("qp recycle" PRItime "live %u keep %u free %u chunks",
+ time, live, keep, free);
+ QP_LOG_STATS("qp recycle after leaf %u live %u used %u free %u hold %u",
+ qp->leaf_count, qp->used_count - qp->free_count,
+ qp->used_count, qp->free_count, qp->hold_count);
+}
+
+/***********************************************************************
+ *
+ * garbage collector
+ */
+
+/*
+ * Move a branch node's twigs to the `bump` chunk, for copy-on-write
+ * or for garbage collection. We don't update the node in place
+ * because `compact_recursive()` does not ensure the node itself is
+ * mutable until after it discovers evacuation was necessary.
+ */
+static qp_ref_t
+evacuate_twigs(dns_qp_t *qp, qp_node_t *n) {
+ qp_weight_t size = branch_twigs_size(n);
+ qp_ref_t old_ref = branch_twigs_ref(n);
+ qp_ref_t new_ref = alloc_twigs(qp, size);
+ qp_node_t *old_twigs = ref_ptr(qp, old_ref);
+ qp_node_t *new_twigs = ref_ptr(qp, new_ref);
+
+ move_twigs(new_twigs, old_twigs, size);
+ free_twigs(qp, old_ref, size);
+
+ /*
+ * free_twigs() could not zero out the old twigs,
+ * so we have to re-attach to any leaves
+ */
+ if (!twigs_mutable(qp, old_ref)) {
+ for (qp_weight_t pos = 0; pos < size; pos++) {
+ qp_node_t *twig = &new_twigs[pos];
+ if (!is_branch(twig)) {
+ attach_leaf(qp, twig);
+ }
+ }
+ }
+
+ return (new_ref);
+}
+
+/*
+ * Evacuate the node's twigs and update the node in place.
+ */
+static void
+evacuate(dns_qp_t *qp, qp_node_t *n) {
+ *n = make_node(branch_index(n), evacuate_twigs(qp, n));
+}
+
+/*
+ * Compact the trie by traversing the whole thing recursively, copying
+ * bottom-up as required. The aim is to avoid evacuation as much as
+ * possible, but when parts of the trie are shared, we need to evacuate
+ * the paths from the root to the parts of the trie that occupy
+ * fragmented chunks.
+ *
+ * Without the "should we evacuate?" check, the algorithm will leave
+ * the trie unchanged. If the twigs are all leaves, the loop changes
+ * nothing, so we will return this node's original ref. If all of the
+ * twigs that are branches did not need moving, again, the loop
+ * changes nothing. So the evacuation check is the only place that the
+ * algorithm introduces ref changes, that then bubble up through the
+ * logic inside the loop.
+ */
+static qp_ref_t
+compact_recursive(dns_qp_t *qp, qp_node_t *parent) {
+ qp_ref_t ref = branch_twigs_ref(parent);
+ /* should we evacuate the twigs? */
+ if (chunk_fragmented(qp, ref_chunk(ref)) || qp->compact_all) {
+ ref = evacuate_twigs(qp, parent);
+ }
+ bool mutable = twigs_mutable(qp, ref);
+ qp_weight_t size = branch_twigs_size(parent);
+ for (qp_weight_t pos = 0; pos < size; pos++) {
+ qp_node_t *child = ref_ptr(qp, ref) + pos;
+ if (!is_branch(child)) {
+ continue;
+ }
+ qp_ref_t old_ref = branch_twigs_ref(child);
+ qp_ref_t new_ref = compact_recursive(qp, child);
+ if (old_ref == new_ref) {
+ continue;
+ }
+ if (!mutable) {
+ ref = evacuate_twigs(qp, parent);
+ /* the twigs have moved */
+ child = ref_ptr(qp, ref) + pos;
+ mutable = true;
+ }
+ *child = make_node(branch_index(child), new_ref);
+ }
+ return (ref);
+}
+
+static void
+compact(dns_qp_t *qp) {
+ isc_time_t t0, t1;
+ uint64_t time;
+
+ QP_LOG_STATS(
+ "qp compact before leaf %u live %u used %u free %u hold %u",
+ qp->leaf_count, qp->used_count - qp->free_count, qp->used_count,
+ qp->free_count, qp->hold_count);
+
+ isc_time_now_hires(&t0);
+
+ /*
+ * Reset the bump chunk if it is fragmented.
+ */
+ if (chunk_fragmented(qp, qp->bump)) {
+ alloc_reset(qp);
+ }
+
+ if (is_branch(&qp->root)) {
+ qp->root = make_node(branch_index(&qp->root),
+ compact_recursive(qp, &qp->root));
+ }
+ qp->compact_all = false;
+
+ isc_time_now_hires(&t1);
+ time = isc_time_microdiff(&t1, &t0);
+ atomic_fetch_add_relaxed(&compact_time, time);
+
+ QP_LOG_STATS("qp compact" PRItime
+ "leaf %u live %u used %u free %u hold %u",
+ time, qp->leaf_count, qp->used_count - qp->free_count,
+ qp->used_count, qp->free_count, qp->hold_count);
+}
+
+void
+dns_qp_compact(dns_qp_t *qp) {
+ REQUIRE(VALID_QP(qp));
+ qp->compact_all = true;
+ compact(qp);
+ recycle(qp);
+}
+
+static void
+auto_compact_recycle(dns_qp_t *qp) {
+ compact(qp);
+ recycle(qp);
+ /*
+ * This shouldn't happen if the garbage collector is
+ * working correctly. We can recover at the cost of some
+ * time and space, but recovery should be cheaper than
+ * letting compact+recycle fail repeatedly.
+ */
+ if (QP_MAX_GARBAGE(qp)) {
+ isc_log_write(dns_lctx, DNS_LOGCATEGORY_DATABASE,
+ DNS_LOGMODULE_QP, ISC_LOG_NOTICE,
+ "qp %p ctx \"%s\" compact/recycle "
+ "failed to recover any space, "
+ "scheduling a full compaction",
+ qp, TRIENAME(qp));
+ qp->compact_all = true;
+ }
+}
+
+/*
+ * Free some twigs and compact the trie if necessary; the space
+ * accounting is similar to `evacuate_twigs()` above.
+ *
+ * This is called by the trie modification API entry points. The
+ * free_twigs() function requires the caller to attach or detach any
+ * leaves as necessary. Callers of squash_twigs() satisfy this
+ * requirement by calling cow_twigs().
+ *
+ * Aside: In typical garbage collectors, compaction is triggered when
+ * the allocator runs out of space. But that is because typical garbage
+ * collectors do not know how much memory can be recovered, so they must
+ * find out by scanning the heap. The qp-trie code was originally
+ * designed to use malloc() and free(), so it has more information about
+ * when garbage collection might be worthwhile. Hence we can trigger
+ * collection when garbage passes a threshold.
+ *
+ * XXXFANF: If we need to avoid latency outliers caused by compaction in
+ * write transactions, we can check qp->transaction_mode here.
+ */
+static inline void
+squash_twigs(dns_qp_t *qp, qp_ref_t twigs, qp_weight_t size) {
+ free_twigs(qp, twigs, size);
+ if (twigs_mutable(qp, twigs) && QP_MAX_GARBAGE(qp)) {
+ auto_compact_recycle(qp);
+ }
+}
+
+/*
+ * Shared twigs need copy-on-write. As we walk down the trie finding
+ * the right place to modify, make_twigs_mutable() is called to ensure
+ * that shared nodes on the path from the root are copied to a mutable
+ * chunk.
+ */
+static inline void
+make_twigs_mutable(dns_qp_t *qp, qp_node_t *n) {
+ if (!twigs_mutable(qp, branch_twigs_ref(n))) {
+ evacuate(qp, n);
+ }
+}
+
+/***********************************************************************
+ *
+ * public accessors for memory management internals
+ */
+
+dns_qp_memusage_t
+dns_qp_memusage(dns_qp_t *qp) {
+ REQUIRE(VALID_QP(qp));
+
+ dns_qp_memusage_t memusage = {
+ .ctx = qp->ctx,
+ .leaves = qp->leaf_count,
+ .live = qp->used_count - qp->free_count,
+ .used = qp->used_count,
+ .hold = qp->hold_count,
+ .free = qp->free_count,
+ .node_size = sizeof(qp_node_t),
+ .chunk_size = QP_CHUNK_SIZE,
+ };
+
+ for (qp_chunk_t chunk = 0; chunk < qp->chunk_max; chunk++) {
+ if (qp->base[chunk] != NULL) {
+ memusage.chunk_count += 1;
+ }
+ }
+
+ /* slight over-estimate if chunks have been shrunk */
+ memusage.bytes = memusage.chunk_count * QP_CHUNK_BYTES +
+ qp->chunk_max * sizeof(*qp->base) +
+ qp->chunk_max * sizeof(*qp->usage);
+
+ return (memusage);
+}
+
+void
+dns_qp_gctime(uint64_t *compact_p, uint64_t *recycle_p, uint64_t *rollback_p) {
+ *compact_p = atomic_load_relaxed(&compact_time);
+ *recycle_p = atomic_load_relaxed(&recycle_time);
+ *rollback_p = atomic_load_relaxed(&rollback_time);
+}
+
+/***********************************************************************
+ *
+ * read-write transactions
+ */
+
+static dns_qp_t *
+transaction_open(dns_qpmulti_t *multi, dns_qp_t **qptp) {
+ dns_qp_t *qp, *old;
+
+ REQUIRE(VALID_QPMULTI(multi));
+ REQUIRE(qptp != NULL && *qptp == NULL);
+
+ LOCK(&multi->mutex);
+
+ old = multi->read;
+ qp = write_phase(multi);
+
+ INSIST(VALID_QP(old));
+ INSIST(!VALID_QP(qp));
+
+ /*
+ * prepare for copy-on-write
+ */
+ *qp = *old;
+ qp->shared_arrays = true;
+ qp->hold_count = qp->free_count;
+
+ /*
+ * Start a new generation, and ensure it isn't zero because we
+ * want to avoid confusion with unset qp->usage structures.
+ */
+ if (++qp->generation == 0) {
+ ++qp->generation;
+ }
+
+ *qptp = qp;
+ return (qp);
+}
+
+/*
+ * a write is light
+ *
+ * We need to ensure we alloce from a fresh chunk if the last transaction
+ * shrunk the bump chunk; but usually in a sequence of write transactions
+ * we just mark the point where we started this generation.
+ *
+ * (Instead of keeping the previous transaction's mode, I considered
+ * forcing allocation into the slow path by fiddling with the bump
+ * chunk's usage counters. But that is troublesome because
+ * `chunk_free_now()` needs to know how much of the chunk to scan.)
+ */
+void
+dns_qpmulti_write(dns_qpmulti_t *multi, dns_qp_t **qptp) {
+ dns_qp_t *qp = transaction_open(multi, qptp);
+ QP_TRACE("");
+
+ if (qp->transaction_mode == QP_UPDATE) {
+ alloc_reset(qp);
+ } else {
+ qp->fender = qp->usage[qp->bump].used;
+ }
+
+ qp->transaction_mode = QP_WRITE;
+ write_protect_all(qp);
+}
+
+/*
+ * an update is heavy
+ *
+ * Make sure we have copies of all usage counters so that we can rollback.
+ * Do this before allocating a bump chunk so that all chunks allocated in
+ * this transaction are in the fresh chunk arrays. (If the existing chunk
+ * arrays happen to be full we might immediately clone them a second time.
+ * Probably not worth worrying about?)
+ */
+void
+dns_qpmulti_update(dns_qpmulti_t *multi, dns_qp_t **qptp) {
+ dns_qp_t *qp = transaction_open(multi, qptp);
+ QP_TRACE("");
+
+ clone_chunk_arrays(qp, qp->chunk_max);
+ alloc_reset(qp);
+
+ qp->transaction_mode = QP_UPDATE;
+ write_protect_all(qp);
+}
+
+void
+dns_qpmulti_commit(dns_qpmulti_t *multi, dns_qp_t **qptp) {
+ dns_qp_t *qp, *old;
+
+ REQUIRE(VALID_QPMULTI(multi));
+ REQUIRE(qptp != NULL);
+ REQUIRE(*qptp == write_phase(multi));
+
+ old = multi->read;
+ qp = write_phase(multi);
+
+ QP_TRACE("");
+
+ if (qp->transaction_mode == QP_UPDATE) {
+ qp_chunk_t c;
+ size_t bytes;
+
+ compact(qp);
+ c = qp->bump;
+ bytes = qp->usage[c].used * sizeof(qp_node_t);
+ if (bytes == 0) {
+ chunk_free(qp, c);
+ } else {
+ qp->base[c] = chunk_shrink_raw(qp, qp->base[c], bytes);
+ }
+ }
+
+#if HAVE_LIBURCU
+ rcu_assign_pointer(multi->read, qp);
+ /*
+ * XXXFANF: At this point we need to wait for a grace period (to be
+ * sure readers have finished) before recovering memory. This is not
+ * very fast, hurting write throughput. To fix it we need read
+ * transactions to be able to survive multiple write transactions, so
+ * that it matters less if we are slow to detect when readers have
+ * exited their critical sections. Instead of the current read / snap
+ * distinction, we need to allocate a read snapshot when a
+ * transaction commits, and clean it up (along with the unused
+ * chunks) in an rcu callback.
+ */
+ synchronize_rcu();
+#else
+ RWLOCK(&multi->rwlock, isc_rwlocktype_write);
+ multi->read = qp;
+ RWUNLOCK(&multi->rwlock, isc_rwlocktype_write);
+#endif
+
+ /*
+ * Were the chunk arrays reallocated at some point?
+ */
+ if (qp->shared_arrays) {
+ INSIST(old->base == qp->base);
+ INSIST(old->usage == qp->usage);
+ /* this becomes correct when `*old` is invalidated */
+ qp->shared_arrays = false;
+ } else {
+ INSIST(old->base != qp->base);
+ INSIST(old->usage != qp->usage);
+ free_chunk_arrays(old);
+ }
+
+ /*
+ * It is safe to recycle all empty chunks if they aren't being
+ * used by snapshots.
+ */
+ qp->hold_count = 0;
+ if (multi->snapshots == 0) {
+ recycle(qp);
+ }
+
+ *old = (dns_qp_t){};
+ *qptp = NULL;
+ UNLOCK(&multi->mutex);
+}
+
+/*
+ * Throw away everything that was allocated during this transaction.
+ */
+void
+dns_qpmulti_rollback(dns_qpmulti_t *multi, dns_qp_t **qptp) {
+ dns_qp_t *qp;
+ isc_time_t t0, t1;
+ uint64_t time;
+ unsigned int free = 0;
+
+ REQUIRE(VALID_QPMULTI(multi));
+ REQUIRE(qptp != NULL);
+ REQUIRE(*qptp == write_phase(multi));
+
+ qp = *qptp;
+
+ REQUIRE(qp->transaction_mode == QP_UPDATE);
+ QP_TRACE("");
+
+ isc_time_now_hires(&t0);
+
+ /*
+ * recycle any chunks allocated in this transaction,
+ * including the bump chunk, and detach value objects
+ */
+ for (qp_chunk_t chunk = 0; chunk < qp->chunk_max; chunk++) {
+ if (qp->base[chunk] != NULL && chunk_mutable(qp, chunk)) {
+ chunk_free(qp, chunk);
+ free++;
+ }
+ }
+
+ /* free the cloned arrays */
+ INSIST(!qp->shared_arrays);
+ free_chunk_arrays(qp);
+
+ isc_time_now_hires(&t1);
+ time = isc_time_microdiff(&t1, &t0);
+ atomic_fetch_add_relaxed(&rollback_time, time);
+
+ QP_LOG_STATS("qp rollback" PRItime "free %u chunks", time, free);
+
+ *qp = (dns_qp_t){};
+ *qptp = NULL;
+ UNLOCK(&multi->mutex);
+}
+
+/***********************************************************************
+ *
+ * read-only transactions
+ */
+
+/*
+ * a query is light
+ */
+
+void
+dns_qpmulti_query(dns_qpmulti_t *multi, dns_qpread_t **qprp) {
+ REQUIRE(VALID_QPMULTI(multi));
+ REQUIRE(qprp != NULL && *qprp == NULL);
+
+#if HAVE_LIBURCU
+ rcu_read_lock();
+ *qprp = (dns_qpread_t *)rcu_dereference(multi->read);
+#else
+ RWLOCK(&multi->rwlock, isc_rwlocktype_read);
+ *qprp = (dns_qpread_t *)multi->read;
+#endif
+}
+
+void
+dns_qpread_destroy(dns_qpmulti_t *multi, dns_qpread_t **qprp) {
+ REQUIRE(VALID_QPMULTI(multi));
+ REQUIRE(qprp != NULL && *qprp != NULL);
+
+ /*
+ * when we are using RCU, then multi->read can change during
+ * our critical section, so it can be different from *qprp
+ */
+ dns_qp_t *qp = (dns_qp_t *)*qprp;
+ *qprp = NULL;
+ REQUIRE(qp == &multi->phase[0] || qp == &multi->phase[1]);
+
+#if HAVE_LIBURCU
+ rcu_read_unlock();
+#else
+ RWUNLOCK(&multi->rwlock, isc_rwlocktype_read);
+#endif
+}
+
+/*
+ * a snapshot is heavy
+ */
+
+void
+dns_qpmulti_snapshot(dns_qpmulti_t *multi, dns_qpsnap_t **qpsp) {
+ dns_qp_t *old;
+ dns_qpsnap_t *qp;
+ size_t array_size, alloc_size;
+
+ REQUIRE(VALID_QPMULTI(multi));
+ REQUIRE(qpsp != NULL && *qpsp == NULL);
+
+ /*
+ * we need a consistent view of the chunk base array and chunk_max so
+ * we can't use the rwlock here (nor can we use dns_qpmulti_query)
+ */
+ LOCK(&multi->mutex);
+ old = multi->read;
+
+ array_size = sizeof(qp_node_t *) * old->chunk_max;
+ alloc_size = sizeof(dns_qpsnap_t) + array_size;
+ qp = isc_mem_allocate(old->mctx, alloc_size);
+ *qp = (dns_qpsnap_t){
+ .magic = QP_MAGIC,
+ .root = old->root,
+ .methods = old->methods,
+ .ctx = old->ctx,
+ .generation = old->generation,
+ .base = qp->base_array,
+ .whence = multi,
+ };
+ /* sometimes we take a snapshot of an empty trie */
+ if (array_size > 0) {
+ memmove(qp->base, old->base, array_size);
+ }
+
+ multi->snapshots++;
+ *qpsp = qp;
+
+ QP_TRACE("multi %p snaps %u", multi, multi->snapshots);
+ UNLOCK(&multi->mutex);
+}
+
+void
+dns_qpsnap_destroy(dns_qpmulti_t *multi, dns_qpsnap_t **qpsp) {
+ dns_qpsnap_t *qp;
+
+ REQUIRE(VALID_QPMULTI(multi));
+ REQUIRE(qpsp != NULL && *qpsp != NULL);
+
+ qp = *qpsp;
+ *qpsp = NULL;
+
+ /*
+ * `multi` and `whence` are redundant, but it helps
+ * to make sure the API is being used correctly
+ */
+ REQUIRE(multi == qp->whence);
+
+ LOCK(&multi->mutex);
+ QP_TRACE("multi %p snaps %u gen %u", multi, multi->snapshots,
+ multi->read->generation);
+
+ isc_mem_free(multi->read->mctx, qp);
+ multi->snapshots--;
+ if (multi->snapshots == 0) {
+ /*
+ * Clean up if there were updates while we were working,
+ * and we are the last snapshot keeping the memory alive
+ */
+ recycle(multi->read);
+ }
+ UNLOCK(&multi->mutex);
+}
+
+/***********************************************************************
+ *
+ * constructors, destructors
+ */
+
+static void
+initialize_guts(isc_mem_t *mctx, const dns_qpmethods_t *methods, void *ctx,
+ dns_qp_t *qp) {
+ REQUIRE(methods != NULL);
+ REQUIRE(methods->attach != NULL);
+ REQUIRE(methods->detach != NULL);
+ REQUIRE(methods->makekey != NULL);
+ REQUIRE(methods->triename != NULL);
+
+ *qp = (dns_qp_t){
+ .magic = QP_MAGIC,
+ .methods = methods,
+ .ctx = ctx,
+ };
+ isc_mem_attach(mctx, &qp->mctx);
+}
+
+void
+dns_qp_create(isc_mem_t *mctx, const dns_qpmethods_t *methods, void *ctx,
+ dns_qp_t **qptp) {
+ dns_qp_t *qp;
+
+ REQUIRE(qptp != NULL && *qptp == NULL);
+
+ qp = isc_mem_get(mctx, sizeof(*qp));
+ initialize_guts(mctx, methods, ctx, qp);
+ alloc_reset(qp);
+ QP_TRACE("");
+ *qptp = qp;
+}
+
+void
+dns_qpmulti_create(isc_mem_t *mctx, const dns_qpmethods_t *methods, void *ctx,
+ dns_qpmulti_t **qpmp) {
+ dns_qpmulti_t *multi;
+ dns_qp_t *qp;
+
+ REQUIRE(qpmp != NULL && *qpmp == NULL);
+
+ multi = isc_mem_get(mctx, sizeof(*multi));
+ *multi = (dns_qpmulti_t){
+ .magic = QPMULTI_MAGIC,
+ .read = &multi->phase[0],
+ };
+ isc_rwlock_init(&multi->rwlock);
+ isc_mutex_init(&multi->mutex);
+
+ /*
+ * Do not waste effort allocating a bump chunk that will be thrown
+ * away when a transaction is opened. dns_qpmulti_update() always
+ * allocates; to ensure dns_qpmulti_write() does too, pretend the
+ * previous transaction was an update
+ */
+ qp = multi->read;
+ initialize_guts(mctx, methods, ctx, qp);
+ qp->transaction_mode = QP_UPDATE;
+ QP_TRACE("");
+ *qpmp = multi;
+}
+
+static void
+destroy_guts(dns_qp_t *qp) {
+ if (qp->leaf_count == 1) {
+ detach_leaf(qp, &qp->root);
+ }
+ if (qp->chunk_max == 0) {
+ return;
+ }
+ for (qp_chunk_t chunk = 0; chunk < qp->chunk_max; chunk++) {
+ if (qp->base[chunk] != NULL) {
+ chunk_free(qp, chunk);
+ }
+ }
+ ENSURE(qp->used_count == 0);
+ ENSURE(qp->free_count == 0);
+ ENSURE(qp->hold_count == 0);
+ free_chunk_arrays(qp);
+}
+
+void
+dns_qp_destroy(dns_qp_t **qptp) {
+ dns_qp_t *qp;
+
+ REQUIRE(qptp != NULL);
+ REQUIRE(VALID_QP(*qptp));
+
+ qp = *qptp;
+ *qptp = NULL;
+
+ /* do not try to destroy part of a dns_qpmulti_t */
+ REQUIRE(qp->transaction_mode == QP_NONE);
+
+ QP_TRACE("");
+ destroy_guts(qp);
+ isc_mem_putanddetach(&qp->mctx, qp, sizeof(*qp));
+}
+
+void
+dns_qpmulti_destroy(dns_qpmulti_t **qpmp) {
+ dns_qp_t *qp = NULL;
+ dns_qpmulti_t *multi = NULL;
+
+ REQUIRE(qpmp != NULL);
+ REQUIRE(VALID_QPMULTI(*qpmp));
+
+ multi = *qpmp;
+ qp = multi->read;
+ *qpmp = NULL;
+
+ REQUIRE(VALID_QP(qp));
+ REQUIRE(!VALID_QP(write_phase(multi)));
+ REQUIRE(multi->snapshots == 0);
+
+ QP_TRACE("");
+ destroy_guts(qp);
+ isc_mutex_destroy(&multi->mutex);
+ isc_rwlock_destroy(&multi->rwlock);
+ isc_mem_putanddetach(&qp->mctx, multi, sizeof(*multi));
+}
+
+/***********************************************************************
+ *
+ * modification
+ */
+
+isc_result_t
+dns_qp_insert(dns_qp_t *qp, void *pval, uint32_t ival) {
+ qp_ref_t new_ref, old_ref;
+ qp_node_t new_leaf, old_node;
+ qp_node_t *new_twigs, *old_twigs;
+ qp_shift_t new_bit, old_bit;
+ dns_qpkey_t new_key, old_key;
+ size_t new_keylen, old_keylen;
+ size_t offset;
+ uint64_t index;
+ qp_shift_t bit;
+ qp_weight_t pos, size;
+ qp_node_t *n;
+
+ REQUIRE(VALID_QP(qp));
+
+ new_leaf = make_leaf(pval, ival);
+ new_keylen = leaf_qpkey(qp, &new_leaf, new_key);
+
+ /* first leaf in an empty trie? */
+ if (qp->leaf_count == 0) {
+ qp->root = new_leaf;
+ qp->leaf_count++;
+ attach_leaf(qp, &new_leaf);
+ return (ISC_R_SUCCESS);
+ }
+
+ /*
+ * We need to keep searching down to a leaf even if our key is
+ * missing from this branch. It doesn't matter which twig we
+ * choose since the keys are all the same up to this node's
+ * offset. Note that if we simply use branch_twig_pos(n, bit)
+ * we may get an out-of-bounds access if our bit is greater
+ * than all the set bits in the node.
+ */
+ n = &qp->root;
+ while (is_branch(n)) {
+ prefetch_twigs(qp, n);
+ bit = branch_keybit(n, new_key, new_keylen);
+ pos = branch_has_twig(n, bit) ? branch_twig_pos(n, bit) : 0;
+ n = branch_twigs_vector(qp, n) + pos;
+ }
+
+ /* do the keys differ, and if so, where? */
+ old_keylen = leaf_qpkey(qp, n, old_key);
+ offset = qpkey_compare(new_key, new_keylen, old_key, old_keylen);
+ if (offset == QPKEY_EQUAL) {
+ return (ISC_R_EXISTS);
+ }
+ new_bit = qpkey_bit(new_key, new_keylen, offset);
+ old_bit = qpkey_bit(old_key, old_keylen, offset);
+
+ qp->leaf_count++;
+ attach_leaf(qp, &new_leaf);
+
+ /* find where to insert a branch or grow an existing branch. */
+ n = &qp->root;
+ while (is_branch(n)) {
+ prefetch_twigs(qp, n);
+ if (offset < branch_key_offset(n)) {
+ goto newbranch;
+ }
+ if (offset == branch_key_offset(n)) {
+ goto growbranch;
+ }
+ make_twigs_mutable(qp, n);
+ bit = branch_keybit(n, new_key, new_keylen);
+ INSIST(branch_has_twig(n, bit));
+ n = branch_twig_ptr(qp, n, bit);
+ }
+
+newbranch:
+ new_ref = alloc_twigs(qp, 2);
+ new_twigs = ref_ptr(qp, new_ref);
+
+ /* save before overwriting. */
+ old_node = *n;
+
+ /* new branch node takes old node's place */
+ index = BRANCH_TAG | (1ULL << new_bit) | (1ULL << old_bit) |
+ ((uint64_t)offset << SHIFT_OFFSET);
+ *n = make_node(index, new_ref);
+
+ /* populate twigs */
+ new_twigs[old_bit > new_bit] = old_node;
+ new_twigs[new_bit > old_bit] = new_leaf;
+
+ return (ISC_R_SUCCESS);
+
+growbranch:
+ INSIST(!branch_has_twig(n, new_bit));
+
+ /* locate twigs vectors */
+ size = branch_twigs_size(n);
+ old_ref = branch_twigs_ref(n);
+ new_ref = alloc_twigs(qp, size + 1);
+ old_twigs = ref_ptr(qp, old_ref);
+ new_twigs = ref_ptr(qp, new_ref);
+
+ /* embiggen branch node */
+ index = branch_index(n) | (1ULL << new_bit);
+ *n = make_node(index, new_ref);
+
+ /* embiggen twigs vector */
+ pos = branch_twig_pos(n, new_bit);
+ move_twigs(new_twigs, old_twigs, pos);
+ new_twigs[pos] = new_leaf;
+ move_twigs(new_twigs + pos + 1, old_twigs + pos, size - pos);
+
+ /* clean up */
+ squash_twigs(qp, old_ref, size);
+
+ return (ISC_R_SUCCESS);
+}
+
+isc_result_t
+dns_qp_deletekey(dns_qp_t *qp, const dns_qpkey_t search_key,
+ size_t search_keylen) {
+ dns_qpkey_t found_key;
+ size_t found_keylen;
+ qp_shift_t bit = 0; /* suppress warning */
+ qp_weight_t pos, size;
+ qp_ref_t ref;
+ qp_node_t *twigs;
+ qp_node_t *parent;
+ qp_node_t *n;
+
+ REQUIRE(VALID_QP(qp));
+
+ parent = NULL;
+ n = &qp->root;
+ while (is_branch(n)) {
+ prefetch_twigs(qp, n);
+ bit = branch_keybit(n, search_key, search_keylen);
+ if (!branch_has_twig(n, bit)) {
+ return (ISC_R_NOTFOUND);
+ }
+ make_twigs_mutable(qp, n);
+ parent = n;
+ n = branch_twig_ptr(qp, n, bit);
+ }
+
+ /* empty trie? */
+ if (leaf_pval(n) == NULL) {
+ return (ISC_R_NOTFOUND);
+ }
+
+ found_keylen = leaf_qpkey(qp, n, found_key);
+ if (qpkey_compare(search_key, search_keylen, found_key, found_keylen) !=
+ QPKEY_EQUAL)
+ {
+ return (ISC_R_NOTFOUND);
+ }
+
+ qp->leaf_count--;
+ detach_leaf(qp, n);
+
+ /* trie becomes empty */
+ if (qp->leaf_count == 0) {
+ INSIST(n == &qp->root && parent == NULL);
+ zero_twigs(n, 1);
+ return (ISC_R_SUCCESS);
+ }
+
+ /* step back to parent node */
+ n = parent;
+ parent = NULL;
+
+ INSIST(bit != 0);
+ size = branch_twigs_size(n);
+ pos = branch_twig_pos(n, bit);
+ ref = branch_twigs_ref(n);
+ twigs = ref_ptr(qp, ref);
+
+ if (size == 2) {
+ /*
+ * move the other twig to the parent branch.
+ */
+ *n = twigs[!pos];
+ squash_twigs(qp, ref, 2);
+ } else {
+ /*
+ * shrink the twigs in place, to avoid using the bump
+ * chunk too fast - the gc will clean up after us
+ */
+ *n = make_node(branch_index(n) & ~(1ULL << bit), ref);
+ move_twigs(twigs + pos, twigs + pos + 1, size - pos - 1);
+ squash_twigs(qp, ref + size - 1, 1);
+ }
+
+ return (ISC_R_SUCCESS);
+}
+
+isc_result_t
+dns_qp_deletename(dns_qp_t *qp, const dns_name_t *name) {
+ dns_qpkey_t key;
+ size_t keylen = dns_qpkey_fromname(key, name);
+ return (dns_qp_deletekey(qp, key, keylen));
+}
+
+/***********************************************************************
+ *
+ * search
+ */
+
+isc_result_t
+dns_qp_getkey(dns_qpreadable_t qpr, const dns_qpkey_t search_key,
+ size_t search_keylen, void **pval_r, uint32_t *ival_r) {
+ dns_qpread_t *qp = dns_qpreadable_cast(qpr);
+ dns_qpkey_t found_key;
+ size_t found_keylen;
+ qp_shift_t bit;
+ qp_node_t *n;
+
+ REQUIRE(VALID_QP(qp));
+ REQUIRE(pval_r != NULL);
+ REQUIRE(ival_r != NULL);
+
+ n = &qp->root;
+ while (is_branch(n)) {
+ prefetch_twigs(qp, n);
+ bit = branch_keybit(n, search_key, search_keylen);
+ if (!branch_has_twig(n, bit)) {
+ return (ISC_R_NOTFOUND);
+ }
+ n = branch_twig_ptr(qp, n, bit);
+ }
+
+ /* empty trie? */
+ if (leaf_pval(n) == NULL) {
+ return (ISC_R_NOTFOUND);
+ }
+
+ found_keylen = leaf_qpkey(qp, n, found_key);
+ if (qpkey_compare(search_key, search_keylen, found_key, found_keylen) !=
+ QPKEY_EQUAL)
+ {
+ return (ISC_R_NOTFOUND);
+ }
+
+ *pval_r = leaf_pval(n);
+ *ival_r = leaf_ival(n);
+ return (ISC_R_SUCCESS);
+}
+
+isc_result_t
+dns_qp_getname(dns_qpreadable_t qpr, const dns_name_t *name, void **pval_r,
+ uint32_t *ival_r) {
+ dns_qpkey_t key;
+ size_t keylen = dns_qpkey_fromname(key, name);
+ return (dns_qp_getkey(qpr, key, keylen, pval_r, ival_r));
+}
+
+/**********************************************************************/
--- /dev/null
+/*
+ * Copyright (C) Internet Systems Consortium, Inc. ("ISC")
+ *
+ * SPDX-License-Identifier: MPL-2.0
+ *
+ * This Source Code Form is subject to the terms of the Mozilla Public
+ * License, v. 2.0. If a copy of the MPL was not distributed with this
+ * file, you can obtain one at https://mozilla.org/MPL/2.0/.
+ *
+ * See the COPYRIGHT file distributed with this work for additional
+ * information regarding copyright ownership.
+ */
+
+/*
+ * For an overview, see doc/design/qp-trie.md
+ */
+
+#pragma once
+
+/***********************************************************************
+ *
+ * interior node basics
+ */
+
+/*
+ * A qp-trie node can be a leaf or a branch. It consists of three 32-bit
+ * words into which the components are packed. They are used as a 64-bit
+ * word and a 32-bit word, but they are not declared like that to avoid
+ * unwanted padding, keeping the size down to 12 bytes. They are in native
+ * endian order so getting the 64-bit part should compile down to an
+ * unaligned load.
+ *
+ * In a branch the 64-bit word is described by the enum below. The 32-bit
+ * word is a reference to the packed sparse vector of "twigs", i.e. child
+ * nodes. A branch node has at least 2 and less than SHIFT_OFFSET twigs
+ * (see the enum below). The qp-trie update functions ensure that branches
+ * actually branch, i.e. branches cannot have only 1 child.
+ *
+ * The contents of each leaf are set by the trie's user. The 64-bit word
+ * contains a pointer value (which must be word-aligned), and the 32-bit
+ * word is an arbitrary integer value.
+ */
+typedef struct qp_node {
+#if WORDS_BIGENDIAN
+ uint32_t bighi, biglo, small;
+#else
+ uint32_t biglo, bighi, small;
+#endif
+} qp_node_t;
+
+/*
+ * A branch node contains a 64-bit word comprising the branch/leaf tag,
+ * the bitmap, and an offset into the key. It is called an "index word"
+ * because it describes how to access the twigs vector (think "database
+ * index"). The following enum sets up the bit positions of these parts.
+ *
+ * In a leaf, the same 64-bit word contains a pointer. The pointer
+ * must be word-aligned so that the branch/leaf tag bit is zero.
+ * This requirement is checked by the newleaf() constructor.
+ *
+ * The bitmap is just above the tag bit. The `bits_for_byte[]` table is
+ * used to fill in a key so that bit tests can work directly against the
+ * index word without superfluous masking or shifting; we don't need to
+ * mask out the bitmap before testing a bit, but we do need to mask the
+ * bitmap before calling popcount.
+ *
+ * The byte offset into the key is at the top of the word, so that it
+ * can be extracted with just a shift, with no masking needed.
+ *
+ * The names are SHIFT_thing because they are qp_shift_t values. (See
+ * below for the various `qp_*` type declarations.)
+ *
+ * These values are relatively fixed in practice; the symbolic names
+ * avoid mystery numbers in the code.
+ */
+enum {
+ SHIFT_BRANCH = 0, /* branch / leaf tag */
+ SHIFT_NOBYTE, /* label separator has no byte value */
+ SHIFT_BITMAP, /* many bits here */
+ SHIFT_OFFSET = 48, /* offset of byte in key */
+};
+
+/*
+ * Value of the node type tag bit.
+ *
+ * It is defined this way to be explicit about where the value comes
+ * from, even though we know it is always the bottom bit.
+ */
+#define BRANCH_TAG (1ULL << SHIFT_BRANCH)
+
+/***********************************************************************
+ *
+ * garbage collector tuning parameters
+ */
+
+/*
+ * A "cell" is a location that can contain a `qp_node_t`, and a "chunk"
+ * is a moderately large array of cells. A big trie can occupy
+ * multiple chunks. (Unlike other nodes, a trie's root node lives in
+ * its `struct dns_qp` instead of being allocated in a cell.)
+ *
+ * The qp-trie allocator hands out space for twigs vectors. Allocations are
+ * made sequentially from one of the chunks; this kind of "sequential
+ * allocator" is also known as a "bump allocator", so in `struct dns_qp`
+ * (see below) the allocation chunk is called `bump`.
+ */
+
+/*
+ * Number of cells in a chunk is a power of 2, which must have space for
+ * a full twigs vector (48 wide). When testing, use a much smaller chunk
+ * size to make the allocator work harder.
+ */
+#ifdef FUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION
+#define QP_CHUNK_LOG 7
+#else
+#define QP_CHUNK_LOG 10
+#endif
+
+STATIC_ASSERT(6 <= QP_CHUNK_LOG && QP_CHUNK_LOG <= 20,
+ "qp-trie chunk size is unreasonable");
+
+#define QP_CHUNK_SIZE (1U << QP_CHUNK_LOG)
+#define QP_CHUNK_BYTES (QP_CHUNK_SIZE * sizeof(qp_node_t))
+
+/*
+ * A chunk needs to be compacted if it has fragmented this much.
+ * (12% overhead seems reasonable)
+ */
+#define QP_MAX_FREE (QP_CHUNK_SIZE / 8)
+
+/*
+ * Compact automatically when we pass this threshold: when there is a lot
+ * of free space in absolute terms, and when we have freed more than half
+ * of the space we allocated.
+ *
+ * The current compaction algorithm scans the whole trie, so it is important
+ * to scale the threshold based on the size of the trie to avoid quadratic
+ * behaviour. XXXFANF find an algorithm that scans less of the trie!
+ *
+ * During a modification transaction, when we copy-on-write some twigs we
+ * count the old copy as "free", because they will be when the transaction
+ * commits. But they cannot be recovered immediately so they are also
+ * counted as on hold, and discounted when we decide whether to compact.
+ */
+#define QP_MAX_GARBAGE(qp) \
+ (((qp)->free_count - (qp)->hold_count) > QP_CHUNK_SIZE * 4 && \
+ ((qp)->free_count - (qp)->hold_count) > (qp)->used_count / 2)
+
+/*
+ * The chunk base and usage arrays are resized geometically and start off
+ * with two entries.
+ */
+#define GROWTH_FACTOR(size) ((size) + (size) / 2 + 2)
+
+/***********************************************************************
+ *
+ * helper types
+ */
+
+/*
+ * C is not strict enough with its integer types for these typedefs to
+ * improve type safety, but it helps to have annotations saying what
+ * particular kind of number we are dealing with.
+ */
+
+/*
+ * The number or position of a bit inside a word. (0..63)
+ *
+ * Note: A dns_qpkey_t is logically an array of qp_shift_t values, but it
+ * isn't declared that way because dns_qpkey_t is a public type whereas
+ * qp_shift_t is private.
+ */
+typedef uint8_t qp_shift_t;
+
+/*
+ * The number of bits set in a word (as in Hamming weight or popcount)
+ * which is used for the position of a node in the packed sparse
+ * vector of twigs. (0..47) because our bitmap does not fill the word.
+ */
+typedef uint8_t qp_weight_t;
+
+/*
+ * A chunk number, i.e. an index into the chunk arrays.
+ */
+typedef uint32_t qp_chunk_t;
+
+/*
+ * Cell offset within a chunk, or a count of cells. Each cell in a
+ * chunk can contain a node.
+ */
+typedef uint32_t qp_cell_t;
+
+/*
+ * A twig reference is used to refer to a twigs vector, which occupies a
+ * contiguous group of cells.
+ */
+typedef uint32_t qp_ref_t;
+
+/*
+ * Constructors and accessors for qp_ref_t values, defined here to show
+ * how the qp_ref_t, qp_chunk_t, qp_cell_t types relate to each other
+ */
+
+static inline qp_ref_t
+make_ref(qp_chunk_t chunk, qp_cell_t cell) {
+ return (QP_CHUNK_SIZE * chunk + cell);
+}
+
+static inline qp_chunk_t
+ref_chunk(qp_ref_t ref) {
+ return (ref / QP_CHUNK_SIZE);
+}
+
+static inline qp_cell_t
+ref_cell(qp_ref_t ref) {
+ return (ref % QP_CHUNK_SIZE);
+}
+
+/***********************************************************************
+ *
+ * main qp-trie structures
+ */
+
+#define QP_MAGIC ISC_MAGIC('t', 'r', 'i', 'e')
+#define VALID_QP(qp) ISC_MAGIC_VALID(qp, QP_MAGIC)
+
+/*
+ * This is annoying: C doesn't allow us to use a predeclared structure as
+ * an anonymous struct member, so we have to fart around. The feature we
+ * want is available in GCC and Clang with -fms-extensions, but a
+ * non-standard extension won't make these declarations neater if we must
+ * also have a standard alternative.
+ */
+
+/*
+ * Lightweight read-only access to a qp-trie.
+ *
+ * Just the fields neded for the hot path. The `base` field points
+ * to an array containing pointers to the base of each chunk like
+ * `qp->base[chunk]` - see `refptr()` below.
+ *
+ * A `dns_qpread_t` has a lifetime that does not extend across multiple
+ * write transactions, so it can share a chunk `base` array belonging to
+ * the `dns_qpmulti_t` it came from.
+ *
+ * We're lucky with the layout on 64 bit systems: this is only 40 bytes,
+ * with no padding.
+ */
+#define DNS_QPREAD_COMMON \
+ uint32_t magic; \
+ qp_node_t root; \
+ qp_node_t **base; \
+ void *ctx; \
+ const dns_qpmethods_t *methods
+
+struct dns_qpread {
+ DNS_QPREAD_COMMON;
+};
+
+/*
+ * Heavyweight read-only snapshots of a qp-trie.
+ *
+ * Unlike a lightweight `dns_qpread_t`, a snapshot can survive across
+ * multiple write transactions, any of which may need to expand the
+ * chunk `base` array. So a `dns_qpsnap_t` keeps its own copy of the
+ * array, which will always be equal to some prefix of the expanded
+ * arrays in the `dns_qpmulti_t` that it came from.
+ *
+ * The `dns_qpmulti_t` keeps a refcount of its snapshots, and while
+ * the refcount is non-zero, chunks are not freed or reused. When a
+ * `dns_qpsnap_t` is destroyed, if it decrements the refcount to zero,
+ * it can do any deferred cleanup.
+ *
+ * The generation number is used for tracing.
+ */
+struct dns_qpsnap {
+ DNS_QPREAD_COMMON;
+ uint32_t generation;
+ dns_qpmulti_t *whence;
+ qp_node_t *base_array[];
+};
+
+/*
+ * Read-write access to a qp-trie requires extra fields to support the
+ * allocator and garbage collector.
+ *
+ * The chunk `base` and `usage` arrays are separate because the `usage`
+ * array is only needed for allocation, so it is kept separate from the
+ * data needed by the read-only hot path. The arrays have empty slots where
+ * new chunks can be placed, so `chunk_max` is the maximum number of chunks
+ * (until the arrays are resized).
+ *
+ * Bare instances of a `struct dns_qp` are used for stand-alone
+ * single-threaded tries. For multithreaded access, transactions alternate
+ * between the `phase` pair of dns_qp objects inside a dns_qpmulti.
+ *
+ * For multithreaded access, the `generation` counter allows us to know
+ * which chunks are writable or not: writable chunks were allocated in the
+ * current generation. For single-threaded access, the generation counter
+ * is always zero, so all chunks are considered to be writable.
+ *
+ * Allocations are made sequentially in the `bump` chunk. Lightweight write
+ * transactions can re-use the `bump` chunk, so its prefix before `fender`
+ * is immutable, and the rest is mutable even though its generation number
+ * does not match the current generation.
+ *
+ * To decide when to compact and reclaim space, QP_MAX_GARBAGE() examines
+ * the values of `used_count`, `free_count`, and `hold_count`. The
+ * `hold_count` tracks nodes that need to be retained while readers are
+ * using them; they are free but cannot be reclaimed until the transaction
+ * has committed, so the `hold_count` is discounted from QP_MAX_GARBAGE()
+ * during a transaction.
+ *
+ * There are some flags that alter the behaviour of write transactions.
+ *
+ * - The `transaction_mode` indicates whether the current transaction is a
+ * light write or a heavy update, or (between transactions) the previous
+ * transaction's mode, because the setup for the next transaction
+ * depends on how the previous one committed. The mode is set at the
+ * start of each transaction. It is QP_NONE in a single-threaded qp-trie
+ * to detect if part of a `dns_qpmulti_t` is passed to dns_qp_destroy().
+ *
+ * - The `compact_all` flag is used when every node in the trie should be
+ * copied. (Usually compation aims to avoid moving nodes out of
+ * unfragmented chunks.) It is used when compaction is explicitly
+ * requested via `dns_qp_compact()`, and as an emergency mechanism if
+ * normal compaction failed to clear the QP_MAX_GARBAGE() condition.
+ * (This emergency is a bug even tho we have a rescue mechanism.)
+ *
+ * - The `shared_arrays` flag indicates that the chunk `base` and `usage`
+ * arrays are shared by both `phase`s in this trie's `dns_qpmulti_t`.
+ * This allows us to delay allocating copies of the arrays during a
+ * write transaction, until we definitely need to resize them.
+ *
+ * - When built with fuzzing support, we can use mprotect() and munmap()
+ * to ensure that incorrect memory accesses cause fatal errors. The
+ * `write_protect` flag must be set straight after the `dns_qpmulti_t`
+ * is created, then left unchanged.
+ *
+ * Some of the dns_qp_t fields are only used for multithreaded transactions
+ * (marked [MT] below) but the same code paths are also used for single-
+ * threaded writes. To reduce the size of a dns_qp_t, these fields could
+ * perhaps be moved into the dns_qpmulti_t, but that would require some kind
+ * of conditional runtime downcast from dns_qp_t to dns_multi_t, which is
+ * likely to be ugly. It is probably best to keep things simple if most tries
+ * need multithreaded access (XXXFANF do they? e.g. when there are many auth
+ * zones),
+ */
+struct dns_qp {
+ DNS_QPREAD_COMMON;
+ isc_mem_t *mctx;
+ /*% array of per-chunk allocation counters */
+ struct {
+ /*% the allocation point, increases monotonically */
+ qp_cell_t used;
+ /*% count of nodes no longer needed, also monotonic */
+ qp_cell_t free;
+ /*% when was this chunk allocated? */
+ uint32_t generation;
+ } *usage;
+ /*% transaction counter [MT] */
+ uint32_t generation;
+ /*% number of slots in `chunk` and `usage` arrays */
+ qp_chunk_t chunk_max;
+ /*% which chunk is used for allocations */
+ qp_chunk_t bump;
+ /*% twigs in the `bump` chunk below `fender` are read only [MT] */
+ qp_cell_t fender;
+ /*% number of leaf nodes */
+ qp_cell_t leaf_count;
+ /*% total of all usage[] counters */
+ qp_cell_t used_count, free_count;
+ /*% cells that cannot be recovered right now */
+ qp_cell_t hold_count;
+ /*% what kind of transaction was most recently started [MT] */
+ enum { QP_NONE, QP_WRITE, QP_UPDATE } transaction_mode : 2;
+ /*% compact the entire trie [MT] */
+ bool compact_all : 1;
+ /*% chunk arrays are shared with a readonly qp-trie [MT] */
+ bool shared_arrays : 1;
+ /*% optionally when compiled with fuzzing support [MT] */
+ bool write_protect : 1;
+};
+
+/*
+ * Concurrent access to a qp-trie.
+ *
+ * The `read` pointer is used for read queries. It points to one of the
+ * `phase` elements. During a transaction, the other `phase` (see
+ * `write_phase()` below) is modified incrementally in copy-on-write
+ * style. On commit the `read` pointer is swapped to the altered phase.
+ */
+struct dns_qpmulti {
+ uint32_t magic;
+ /*% controls access to the `read` pointer and its target phase */
+ isc_rwlock_t rwlock;
+ /*% points to phase[r] and swaps on commit */
+ dns_qp_t *read;
+ /*% protects the snapshot counter and `write_phase()` */
+ isc_mutex_t mutex;
+ /*% so we know when old chunks are still shared */
+ unsigned int snapshots;
+ /*% one is read-only, one is mutable */
+ dns_qp_t phase[2];
+};
+
+/*
+ * Get a pointer to the phase that isn't read-only.
+ */
+static inline dns_qp_t *
+write_phase(dns_qpmulti_t *multi) {
+ bool read0 = multi->read == &multi->phase[0];
+ return (read0 ? &multi->phase[1] : &multi->phase[0]);
+}
+
+#define QPMULTI_MAGIC ISC_MAGIC('q', 'p', 'm', 'v')
+#define VALID_QPMULTI(qp) ISC_MAGIC_VALID(qp, QPMULTI_MAGIC)
+
+/***********************************************************************
+ *
+ * interior node constructors and accessors
+ */
+
+/*
+ * See the comments under "interior node basics" above, which explain the
+ * layout of nodes as implemented by the following functions.
+ */
+
+/*
+ * Get the 64-bit word of a node.
+ */
+static inline uint64_t
+node64(qp_node_t *n) {
+ uint64_t lo = n->biglo;
+ uint64_t hi = n->bighi;
+ return (lo | (hi << 32));
+}
+
+/*
+ * Get the 32-bit word of a node.
+ */
+static inline uint32_t
+node32(qp_node_t *n) {
+ return (n->small);
+}
+
+/*
+ * Create a node from its parts
+ */
+static inline qp_node_t
+make_node(uint64_t big, uint32_t small) {
+ return ((qp_node_t){
+ .biglo = (uint32_t)(big),
+ .bighi = (uint32_t)(big >> 32),
+ .small = small,
+ });
+}
+
+/*
+ * Test a node's tag bit.
+ */
+static inline bool
+is_branch(qp_node_t *n) {
+ return (n->biglo & BRANCH_TAG);
+}
+
+/* leaf nodes *********************************************************/
+
+/*
+ * Get a leaf's pointer value. The double cast is to avoid a warning
+ * about mismatched pointer/integer sizes on 32 bit systems.
+ */
+static inline void *
+leaf_pval(qp_node_t *n) {
+ return ((void *)(uintptr_t)node64(n));
+}
+
+/*
+ * Get a leaf's integer value
+ */
+static inline uint32_t
+leaf_ival(qp_node_t *n) {
+ return (node32(n));
+}
+
+/*
+ * Create a leaf node from its parts
+ */
+static inline qp_node_t
+make_leaf(const void *pval, uint32_t ival) {
+ qp_node_t leaf = make_node((uintptr_t)pval, ival);
+ REQUIRE(!is_branch(&leaf) && pval != NULL);
+ return (leaf);
+}
+
+/* branch nodes *******************************************************/
+
+/*
+ * The following function names use plural `twigs` when they work on a
+ * branch's twigs vector as a whole, and singular `twig` when they work on
+ * a particular twig.
+ */
+
+/*
+ * Get a branch node's index word
+ */
+static inline uint64_t
+branch_index(qp_node_t *n) {
+ return (node64(n));
+}
+
+/*
+ * Get a reference to a branch node's child twigs.
+ */
+static inline qp_ref_t
+branch_twigs_ref(qp_node_t *n) {
+ return (node32(n));
+}
+
+/*
+ * Bit positions in the bitmap come directly from the key. DNS names are
+ * converted to keys using the tables declared at the end of this file.
+ */
+static inline qp_shift_t
+qpkey_bit(const dns_qpkey_t key, size_t len, size_t offset) {
+ if (offset < len) {
+ return (key[offset]);
+ } else {
+ return (SHIFT_NOBYTE);
+ }
+}
+
+/*
+ * Extract a branch node's offset field, used to index the key.
+ */
+static inline size_t
+branch_key_offset(qp_node_t *n) {
+ return ((size_t)(branch_index(n) >> SHIFT_OFFSET));
+}
+
+/*
+ * Which bit identifies the twig of this node for this key?
+ */
+static inline qp_shift_t
+branch_keybit(qp_node_t *n, const dns_qpkey_t key, size_t len) {
+ return (qpkey_bit(key, len, branch_key_offset(n)));
+}
+
+/*
+ * Convert a twig reference into a pointer.
+ */
+static inline qp_node_t *
+ref_ptr(dns_qpreadable_t qpr, qp_ref_t ref) {
+ dns_qpread_t *qp = dns_qpreadable_cast(qpr);
+ return (qp->base[ref_chunk(ref)] + ref_cell(ref));
+}
+
+/*
+ * Get a pointer to a branch node's twigs vector.
+ */
+static inline qp_node_t *
+branch_twigs_vector(dns_qpreadable_t qpr, qp_node_t *n) {
+ dns_qpread_t *qp = dns_qpreadable_cast(qpr);
+ return (ref_ptr(qp, branch_twigs_ref(n)));
+}
+
+/*
+ * Warm up the cache while calculating which twig we want.
+ */
+static inline void
+prefetch_twigs(dns_qpreadable_t qpr, qp_node_t *n) {
+ __builtin_prefetch(branch_twigs_vector(qpr, n));
+}
+
+/***********************************************************************
+ *
+ * bitmap popcount shenanigans
+ */
+
+/*
+ * How many twigs appear in the vector before the one corresponding to the
+ * given bit? Calculated using popcount of part of the branch's bitmap.
+ *
+ * To calculate a mask that covers the lesser bits in the bitmap, we
+ * subtract 1 to set the bits, and subtract the branch tag because it
+ * is not part of the bitmap.
+ */
+static inline qp_weight_t
+branch_twigs_before(qp_node_t *n, qp_shift_t bit) {
+ uint64_t mask = (1ULL << bit) - 1 - BRANCH_TAG;
+ uint64_t bmp = branch_index(n) & mask;
+ return ((qp_weight_t)__builtin_popcountll(bmp));
+}
+
+/*
+ * How many twigs does this node have?
+ *
+ * The offset is directly after the bitmap so the offset's lesser bits
+ * covers the whole bitmap, and the bitmap's weight is the number of twigs.
+ */
+static inline qp_weight_t
+branch_twigs_size(qp_node_t *n) {
+ return (branch_twigs_before(n, SHIFT_OFFSET));
+}
+
+/*
+ * Position of a twig within the packed sparse vector.
+ */
+static inline qp_weight_t
+branch_twig_pos(qp_node_t *n, qp_shift_t bit) {
+ return (branch_twigs_before(n, bit));
+}
+
+/*
+ * Get a pointer to a particular twig.
+ */
+static inline qp_node_t *
+branch_twig_ptr(dns_qpreadable_t qpr, qp_node_t *n, qp_shift_t bit) {
+ return (branch_twigs_vector(qpr, n) + branch_twig_pos(n, bit));
+}
+
+/*
+ * Is the twig identified by this bit present?
+ */
+static inline bool
+branch_has_twig(qp_node_t *n, qp_shift_t bit) {
+ return (branch_index(n) & (1ULL << bit));
+}
+
+/* twig logistics *****************************************************/
+
+static inline void
+move_twigs(qp_node_t *to, qp_node_t *from, qp_weight_t size) {
+ memmove(to, from, size * sizeof(qp_node_t));
+}
+
+static inline void
+zero_twigs(qp_node_t *twigs, qp_weight_t size) {
+ memset(twigs, 0, size * sizeof(qp_node_t));
+}
+
+/***********************************************************************
+ *
+ * method invocation helpers
+ */
+
+static inline void
+attach_leaf(dns_qpreadable_t qpr, qp_node_t *n) {
+ dns_qpread_t *qp = dns_qpreadable_cast(qpr);
+ qp->methods->attach(qp->ctx, leaf_pval(n), leaf_ival(n));
+}
+
+static inline void
+detach_leaf(dns_qpreadable_t qpr, qp_node_t *n) {
+ dns_qpread_t *qp = dns_qpreadable_cast(qpr);
+ qp->methods->detach(qp->ctx, leaf_pval(n), leaf_ival(n));
+}
+
+static inline size_t
+leaf_qpkey(dns_qpreadable_t qpr, qp_node_t *n, dns_qpkey_t key) {
+ dns_qpread_t *qp = dns_qpreadable_cast(qpr);
+ return (qp->methods->makekey(key, qp->ctx, leaf_pval(n), leaf_ival(n)));
+}
+
+static inline char *
+triename(dns_qpreadable_t qpr, char *buf, size_t size) {
+ dns_qpread_t *qp = dns_qpreadable_cast(qpr);
+ qp->methods->triename(qp->ctx, buf, size);
+ return (buf);
+}
+
+#define TRIENAME(qp) \
+ triename(qp, (char[DNS_QP_TRIENAME_MAX]){}, DNS_QP_TRIENAME_MAX)
+
+/***********************************************************************
+ *
+ * converting DNS names to trie keys
+ */
+
+/*
+ * This is a deliberate simplification of the hostname characters,
+ * because it doesn't matter much if we treat a few extra characters
+ * favourably: there is plenty of space in the index word for a
+ * slightly larger bitmap.
+ */
+static inline bool
+qp_common_character(uint8_t byte) {
+ return (('-' <= byte && byte <= '9') || ('_' <= byte && byte <= 'z'));
+}
+
+/*
+ * Lookup table mapping bytes in DNS names to bit positions, used
+ * by dns_qpkey_fromname() to convert DNS names to qp-trie keys.
+ */
+extern uint16_t dns_qp_bits_for_byte[];
+
+/*
+ * And the reverse, mapping bit positions to characters, so the tests
+ * can print diagnostics involving qp-trie keys.
+ */
+extern uint8_t dns_qp_byte_for_bit[];
+
+/**********************************************************************/