From c63afb036b3cc190b4acb03722443c52b76ab384 Mon Sep 17 00:00:00 2001 From: aBainbridge11 <113794078+aBainbridge11@users.noreply.github.com> Date: Tue, 23 Jul 2024 11:57:35 -0400 Subject: [PATCH] Create Scheduler --- .../modules/developers/pages/scheduler.adoc | 152 ++++++++++++++++++ 1 file changed, 152 insertions(+) create mode 100644 doc/antora/modules/developers/pages/scheduler.adoc diff --git a/doc/antora/modules/developers/pages/scheduler.adoc b/doc/antora/modules/developers/pages/scheduler.adoc new file mode 100644 index 0000000000..41e71e8c4f --- /dev/null +++ b/doc/antora/modules/developers/pages/scheduler.adoc @@ -0,0 +1,152 @@ +# Scheduler + +Each worker thread runs a scheduler. Which does the following: + +* check the resumable queue for "too old" packets. +** once per second, and then check check all packets older than 1s. +** ideally printing out where the request is blocked (module, instance, name) +** See yield / resume, below +* service the event list +** FD events first +** followed by timer events +** so that we read packets from sockets before they time out... +* service the incoming queue from network threads +** add packets to the "decode" list, which is run at a higher priority than other requests. +** the idea here is to quickly clean up the messages between network threads and worker threads +** push decoded packets onto the "runnable" priority heap, which is ordered by time. +* grab a request from the "runnable" heap. +* check `fr_status_continue()` +** if socket is closed, drop the request +** if conflicting packet has come in, drop the request +** process the packet until it's done, or until it yields +* keep looping + +## Signaling + +See the channel page for more details. And +signaling page. + +The main problem we have with network / worker threads is signaling. +If the packets are widely spaced, the network thread can signal the +worker thread for every packet, and vice-versa for every reply. +However, if the packets arrive quickly, each end should switch to +busy-polling. + +How to do this is non-trivial. + +## Scheduling + +Each network thread runs its own scheduler over the worker threads. +Not that the network thread may be away of only a subset of worker +threads. Splitting the scheduler like this means there is minimal +contention, and ideal scaling. + +The worker threads are weighted by total CPU time spent processing +requests. They are put into a priority heap, ordered by CPU time. +This CPU time is not only the historical CPU time spent processing +requests, but also the predicted CPU time based on current "live" +requests. This information is passed from the worker thread to the +network thread on every reply. + +The network thread puts the workers into a priority heap. When a +packet comes in, the first element is popped off of the heap, the +worker thread CPU time is updated (based on the predicted time spent +processing the new packet), and the thread is inserted back into the +priority heap. + +As the heap will generally be small, the overhead of this work will be +small. + +## Statistics + +We need delay statistics for slow modules (Hi, SQL!). The best +solution is the following. + +Times are recorded on `unlang_yield()` and `unlang_resume()`. The +delta is the time spent waiting on events for the module to do +something. In addition, we need to track the module name, the +instance name, and a name of this yield point (e.g. user lookup, +versus group lookup). + +We do this via dictionary attributes. Which there is no *requirement* +to do so, doing it this way enables policies to be based on these +times, and allows for simple tracking / stats. + +We create a new list `stats` for each request. This is a list of +statistics attributes that is (ideally) read-only. Only the server +core can write to it. + +The `modules.c` code then creates a series of TLVs, all under one +parent TLV. 8 bit sub-TLVs should be good enough here. +---- +Module-Statistics + Module-Stats-SQL + Module-Stats-SQL-SQL1 + Module-Stats-SQL-SQL1-operation1 + Module-Stats-SQL-SQL1-operation1-Delay + Module-Stats-SQL-SQL1-operation1-Total-Requests + Module-Stats-SQL-SQL1-operation1-Delay-1us + Module-Stats-SQL-SQL1-operation1-Delay-10us + Module-Stats-SQL-SQL1-operation1-Delay-100us + Module-Stats-SQL-SQL1-operation1-Delay-1ms + Module-Stats-SQL-SQL1-operation1-Delay-10ms + Module-Stats-SQL-SQL1-operation1-Delay-100ms + Module-Stats-SQL-SQL1-operation1-Delay-1s + Module-Stats-SQL-SQL1-operation1-Delay-10s + Module-Stats-SQL-SQL1-operation1-Delay-100s +---- + +All are TLVs, except for the lowest layer, which are `integer64`. + +The `modules.c` code creates the first few layers of each TLV. + +The `modules` header file defines a few macros: + + DEFINE_YIELD_POINT(name) + USE_YEILD_POINT(inst, name) + CACHE_DA_YIELD_POINT(name) + CREATE_DA_YIELD_POINT(inst, name) + +Which (in turn): + +* defines a named yeild point, and updates various other internal macros to be used by `CACHE_DA` and `CREATE_DA` + +* uses the pre-defined `da` inside of an `unlang_yield()` call. The + `da` should be cached inside of the module instance struct, so that + we don't have to do dictionary lookups at run time. + +* creates a `fr_dict_attr_t const *da_yeild_point_NAME;` inside of the module instance structure. + +* in the module `bootstrap()` function, creates all of the relevant + `da`s and caches them in the module instance. + +We should then also have a `module_stats` module, which looks at the +per-request stats, and auto-creates the bins for each yeild point. +And probably also aggregates the stats up the TLV tree. + +Then, once per 1K requests, or every second (whatever is shorter), +grabs a mutex lock, and aggregates the binned data to global counters. + +That way the stats API can just query that module, and the module +returns total stats for all delays in the server. + +## Yield / Resume + +The thread needs to know which requests are yielded, so it can issue +warnings about blocked modules. The best way here is for the +`REQUEST` to have a pointer to thread-specific data (yielded linked +list), and adds itself to the tail of the list. When the request +resumes, it removes itself from the list. + +In the current code, when a request is waiting for a timeout or a +socket event, the request is "lost", and buried inside of the event +loop. The thread has no way of knowing where the request is. + +With this solution, the thread has a linked list (oldest to newest) of +yielded requests. + +This design goes along with the philosophy of the rest of the server +of "track as little as possible, and do work asynchronously where +possible". + +!Diagram of yield / resume timing -- 2.47.3