From: Vicky Risk Date: Wed, 22 Aug 2018 23:34:02 +0000 (-0400) Subject: Delete 01-scaling-across-cores X-Git-Tag: gitlab20_base~13 X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=ca90c1899a66a30e061f67621b24750ffb17738d;p=thirdparty%2Fkea.git Delete 01-scaling-across-cores --- diff --git a/doc/design/resolver/01-scaling-across-cores b/doc/design/resolver/01-scaling-across-cores deleted file mode 100644 index 56b8dcf605..0000000000 --- a/doc/design/resolver/01-scaling-across-cores +++ /dev/null @@ -1,347 +0,0 @@ -Scaling across (many) cores -=========================== - -Problem statement ------------------ - -The general issue is how to insure that the resolver scales. - -Currently resolvers are CPU bound, and it seems likely that both -instructions-per-cycle and CPU frequency will not increase radically, -scaling will need to be across multiple cores. - -How can we best scale a recursive resolver across multiple cores? - -Image of how resolution looks like ----------------------------------- - - Receive the query. @# <------------------------\ - | | - | | - v | - Parse it, etc. $ | - | | - | | - v | - Look into the cache. $# | - Cry <---- No <---------- Is it there? -----------> Yes ---------\ | - | ^ | | - Prepare upstream query $ | | | - | | | | - v | | | - Send an upstream query (#) | | | - | | | | - | | | | - v | | | - Wait for answer @(#) | | | - | | | | - v | | | - Parse $ | | | - | | | | - v | | | - Is it enough? $ ----> No ---------/ | | - | | | - Yes | | - | | | - \-----------------------> Build answer $ <----------------------/ | - | | - | | - v | - Send answer # -----------------------------/ - -This is simplified version, however. There may be other tasks (validation, for -example), which are not drawn mostly for simplicity, as they don't produce more -problems. The validation would be done as part of some computational task and -they could do more lookups in the cache or upstream queries. - -Also, multiple queries may generate the same upstream query, so they should be -aggregated together somehow. - -Legend -~~~~~~ - * $ - CPU intensive - * @ - Waiting for external event - * # - Possible interaction with other tasks - -Goals ------ - * Run the CPU intensive tasks in multiple threads to allow concurrency. - * Minimize waiting for locks. - * Don't require too much memory. - * Minimize the number of upstream queries (both because they are slow and - expensive and also because we don't want to eat too much bandwidth and spam - the authoritative servers). - * Design simple enough so it can be implemented. - -Naïve version -------------- - -Let's look at possible approaches and list their pros and cons. Many of the -simple versions would not really work, but let's have a look at them anyway, -because thinking about them might bring some solutions for the real versions. - -We take one query, handle it fully, with blocking waits for the answers. After -this is done, we take another. The cache is private for each one process. - -Advantages: - - * Very simple. - * No locks. - -Disadvantages: - - * To scale across cores, we need to run *a lot* of processes, since they'd be - waiting for something most of their time. That means a lot of memory eaten, - because each one has its own cache. Also, running so many processes may be - problematic, processes are not very cheap. - * Many things would be asked multiple times, because the caches are not - shared. - -Threads -~~~~~~~ - -Some of the problems could be solved by using threads, but they'd not improve -it much, since threads are not really cheap either (starting several hundred -threads might not be a good idea either). - -Also, threads bring other problems. When we still assume separate caches (for -caches, see below), we need to ensure safe access to logging, configuration, -network, etc. These could be a bottleneck (eg. if we lock every time we read a -packet from network, when there are many threads, they'll just fight over the -lock). - -Supercache -~~~~~~~~~~ - -The problem with cache could be solved by placing a ``supercache'' between the -resolvers and the Internet. That one would do almost no processing, it would -just take the query, looked up in the cache and either answered from the cache -or forwarded the query to the external world. It would store the answer and -forward it back. - -The cache, if single-threaded, could be a bottle-neck. To solve it, there could -be several approaches: - -Layered cache:: - Each process has it's own small cache, which catches many queries. Then, a - group of processes shares another level of bigger cache, which catches most - of the queries that get past the private caches. We further group them and - each level handles less queries from each process, so they can keep up. - However, with each level, we add some overhead to do another lookup. -Segmented cache:: - We have several caches of the same level, in parallel. When we would ask a - cache, we hash the query and decide which cache to ask by the hash. Only that - cache would have that answer if any and each could run in a separate process. - The only problem is, could there be a pattern of queries that would skew to - use only one cache while the rest would be idle? -Shared cache access:: - A cache would be accessed by multiple processes/threads. See below for - details, but there's a risk of lock contention on the cache (it depends on - the data structure). - -Upstream queries -~~~~~~~~~~~~~~~~ - -Before doing an upstream query, we look into the cache to ensure we don't have -the information yet. When we get the answer, we want to update the cache. - -This suggests the upstream queries are tightly coupled with the cache. Now, -when we have several cache processes/threads, each can have some set of opened -sockets which are not shared with other caches to do the lookups. This way we -can avoid locking the upstream network communication. - -Also, we can have three conceptual states for data in cache, and act -differently when it is requested. - -Present:: - If it is available, in positive or negative version, we just provide the - answer right away. -Not present:: - The continuation of processing is queued somehow (blocked/callback is - stored/whatever). An upstream query is sent and we get to the next state. -Waiting for answer:: - If another query for the same thing arrives, we just queue it the same way - and keep waiting. When the answer comes, all the queued tasks are resumed. - If the TTL > 0, we store the answer and set it to ``present''. - -We want to do aggregation of upstream queries anyway, using cache for it saves -some more processing and possibly locks. - -Multiple parallel queries -------------------------- - -It seems obvious we can't afford to have a thread or process for each -outstanding query. We need to handle multiple queries in each one at any given -time. - -Coroutines -~~~~~~~~~~ - -The OS-level threads might be too expensive, but coroutines might be cheap -enough. In that way, we could still write a code that would be easy to read, -but limit the number of OS threads to reasonable number. - -In this model, when a query comes, a new coroutine/user-level thread is created -for it. We use special reads and writes whenever there's an operation that -could block. These reads and writes would internally schedule the operation -and switch to another coroutine (if there's any ready to be executed). - -Each thread/process maintains its own set of coroutines and they do not -migrate. This way, the queue of coroutines is kept lock-less, as well as any -private caches. Only the shared caches are protected by a lock. - -[NOTE] -The `coro` unit we have in the current code is *not* considered a coroutine -library here. We would need a coroutine library where we have real stack for -each coroutine and we switch the stacks on coroutine switch. That is possible -with reasonable amount of dark magic (see `ucontext.h`, for example, but there -are surely some higher-level libraries for that). - -There are some trouble with multiple coroutines waiting on the same event, like -the same upstream query (possibly even coroutines from different threads), but -it should be possible to solve. - -Event-based -~~~~~~~~~~~ - -We use events (`asio` and stuff) for writing it. Each outstanding query is an -object with some callbacks on it. When we would do a possibly blocking -operation, we schedule a callback to happen once the operation finishes. - -This is more lightweight than the coroutines (the query objects will be smaller -than the stacks for coroutines), but it is harder to write and read for. - -[NOTE] -Do not consider cross-breeding the models. That leads to space-time distortions -and brain damage. Implementing one on top of other is OK, but mixing it in the -same bit of code is a way do madhouse. - -Landlords and peasants -~~~~~~~~~~~~~~~~~~~~~~ - -In both the coroutines and event-based models, the cache and other shared -things are easier to imagine as objects the working threads fight over to hold -for a short while. In this model, it is easier to imagine each such shared -object as something owned by a landlord that doesn't let anyone else on it, -but you can send requests to him. - -A query is an object once again, with some kind of state machine. - -Then there are two kinds of threads. The peasants are just to do the heavy -work. There's a global work-queue for peasants. Once a peasant is idle, it -comes to the queue and picks up a handful of queries from there. It does as -much on each the query as possible without requiring any shared resource. - -The other kind, the landlords, have a resource to watch over each. So we would -have a cache (or several parts of cache), the sockets for accepting queries and -answering them, possibly more. Each of these would have a separate landlord -thread and a queue of tasks to do on the resource (look up something, send an -answer...). - -Similarly, the landlord would take a handful of tasks from its queue and start -handling them. It would possibly produce some more tasks for the peasants. - -The point here is, all the synchronisation is done on the queues, not on the -shared resources themselves. And, we would append to a queues once the whole -batch was completed. By tweaking the size of the batch, we could balance the -lock contention, throughput and RTT. The append/remove would be a quick -operation, and the cost of locks would amortize in the larger amount of queries -handled per one lock operation. - -The possible downside is, a query needs to travel across several threads during -its lifetime. It might turn out it is faster to move the query between cores -than accessing the cache from several threads, since it is smaller, but it -might be slower as well. - -It would be critical to make some kind of queue that is fast to append to and -fast to take out first n items. Also, the tasks in the queues can be just -abstract `boost::function` functors, and each worker would just -iterate through the queue, calling each functor. The parameter would be to -allow easy generation of more tasks for other queues (they would be stored -privately first, and appended to remote queues at the end of batch). - -Also, if we wanted to generate multiple parallel upstream queries from a single -query, we would need to be careful. A query object would not have a lock on -itself and the upstream queries could end up in a different caches/threads. To -protect the original query, we would add another landlord that would aggregate -answers together and let the query continue processing once it got enough -answers. That way, the answers would be pushed all to the same threads and they -could not fight over the query. - -[NOTE] -This model would work only with threads, not processes. - -Shared caches -------------- - -While it seems it is good to have some sort of L1 cache with pre-rendered -answers (according to measurements in the #2777 ticket), we probably need some -kind of larger shared cache. - -If we had just a single shared cache protected by lock, there'd be a lot of -lock contention on the lock. - -Partitioning the cache -~~~~~~~~~~~~~~~~~~~~~~ - -We split the cache into parts, either by the layers or by parallel bits we -switch between by a hash. If we take it to the extreme, a lock on each hash -bucket would be this kind, though that might be wasting resources (how -expensive is it to create a lock?). - -Landlords -~~~~~~~~~ - -The landlords do synchronizations themselves. Still, the cache would need to be -partitioned. - -RCU -~~~ - -The RCU is a lock-less synchronization mechanism. An item is accessed through a -pointer. An updater creates a copy of the structure (in our case, it would be -content of single hash bucket) and then atomically replaces the pointer. The -readers from before have the old version, the new ones get the new version. -When all the old readers die out, the old copy is reclaimed. Also, the -reclamation can AFAIK be postponed for later times when we are slightly more -idle or to a different thread. - -We could use it for cache ‒ in the fast track, we would just read the cache. In -the slow one, we would have to wait in queue to do the update, in a single -updater thread (because we don't really want to be updating the same cell twice -at the same time). - -Proposals ---------- - -In either case, we would have some kind of L1 cache with pre-rendered answers. -For these proposals (except the third), we wouldn't care if we split the cache -into parallel chunks or layers. - -Hybrid RCU/Landlord -~~~~~~~~~~~~~~~~~~~ - -The landlord approach, just read only accesses to the cache are done directly -by the peasants. Only if they don't find what they want, they'd append the -queue to the task of the landlord. The landlord would be doing the RCU updates. -It could happen that by the time the landlord gets to the task the answer is -already there, but that would not matter much. - -Accessing network would be from landlords. - -Coroutines+RCU -~~~~~~~~~~~~~~ - -We would do the coroutines, and the reads from shared cache would go without -locking. When doing write, we would have to lock. - -To avoid locking, each worker thread would have its own set of upstream sockets -and we would dup the sockets from users so we don't have to lock that. - -Multiple processes with coroutines and RCU -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -This would need the layered cache. The upper caches would be mapped to local -memory for read-only access. Each cache would be a separate process. The -process would do the updates ‒ if the answer was not there, the process would -be asked by some kind of IPC to pull it from upstream cache or network.