From f7221372c758bc96f55c6a038544226dc1ac79a4 Mon Sep 17 00:00:00 2001
From: aBainbridge11 <113794078+aBainbridge11@users.noreply.github.com>
Date: Tue, 23 Jul 2024 11:04:29 -0400
Subject: [PATCH] Create Message API

---
 .../modules/developers/pages/message.adoc     | 312 ++++++++++++++++++
 1 file changed, 312 insertions(+)
 create mode 100644 doc/antora/modules/developers/pages/message.adoc

diff --git a/doc/antora/modules/developers/pages/message.adoc b/doc/antora/modules/developers/pages/message.adoc
new file mode 100644
index 0000000000..fb60322d8c
--- /dev/null
+++ b/doc/antora/modules/developers/pages/message.adoc
@@ -0,0 +1,312 @@
+# Message API
+
+The message API is designed to exchange messages between network and
+worker threads.
+
+The v3 design is just a `REQUEST` structure, and a FIFO, along with
+mutexes for thread safety.  This design works, but has issues.  For
+one, mutex contention is a problem.  Two, memory is often allocated in
+one thread and freed in another.  The API is simple, but not
+sophisticated.
+
+The goals for the v4 message API are:
+
+* allow it to handle non-RADIUS traffic,
+* minimize memory copying
+* minimize memory allocation / free
+* avoid mutex locks where possible
+* be high performance
+
+## Overview
+
+The design of the message API in v4 relies on three components.
+
+First, there is a thread-safe queue of messages.  The queue itself
+does not contain the messages, but instead contains pointers to the
+messages.  This choice means that the queue is small, and easily
+manageable.  It also means that writes to the queue are atomic.
+
+Second, there is a slab allocator message headers.  The message
+headers are fixed size, and do not need to be allocated / freed for
+every message.
+
+Third, there is a ring buffer for packet data.  This buffer is large
+(64K to 1M+), and contains the raw packet data.
+
+An important part of the message implmentation is that there are
+multiple instances of the message API (i.e. the three items noted
+above).  For example, a network thread may have one instance for each
+socket that it's reading from, and a worker thread may have one for
+each network thread it's writing packets to.
+
+The reason for multiple instances is not just thread-safety, but
+management.  One of the most common problems in asynchronous systems
+is data ownership.  Owning data is cheap, exchanging ownership of data
+is expensive.
+
+In this model, the system that creates messages is the system that
+owns the message.  The message target can (temporily) access the
+message, but it does not own the message.
+
+The messages are also ephemeral, and short lived.  If the data is
+needed for any length of time, it must be copied out of the message
+subsystem to somewhere else.
+
+### Queue
+
+The queue of messages is a simple FIFO set of pointers.  Where
+possible, this is done via a lockless queue.  Otherwise, mutexes are
+used.  We may also want a ring buffer for the queue, as it's just
+pointers.  Then, CAS for updates...
+
+Queues are owned by the recipient of the messages.  i.e. multiple
+originators send messages to one queue.  Each recipient (worker
+thread, network socket, etc.) has it's own queue.
+
+Messages are placed onto the queue by the originator, and the
+destination is signalled that a message is ready.
+
+If the queue has at least one message in it, no signaling is done.  If
+the queue transitions from zero messages to N messages, the originator
+will signal the recipient, via a queue-specific `kqueue` signal
+(EVFILT_USER).
+
+When the recipient receives the message, it will either drain the
+queue, or it will have some other way (e.g. socket write ready) of
+signalling to itself that it should write more messages from the queue
+into the socket.
+
+Similar arguments apply for worker threads.  While they're not writing
+to sockets, they can completely drain the queue every time they get a
+signal.
+
+The downside of this approach is that there are 6 system calls for
+every packet.  One to read the packet, one to signal the queue, one to
+receive the signal from the queue, and then the same (in reverse) to
+write the reply.  **Lowering this overhead is of high priority.**
+
+### Message Headers
+
+The message headers are allocated via a slab allocator, possibly a
+ring buffer.  The message headers are fixed size.
+
+We can't put messages into the ring buffer because of issues with TCP.
+We want to be able to read lots of TCP data into a buffer, and that
+data may encompass multiple RADIUS packets.
+
+The message API allows for the allocation of a message, possibly with
+bulk data.  This allocation just grabs the next free header.  The bulk
+data allocation is discussed below.
+
+Once a message has been created and filled out, it is written to the
+queue (i.e. a pointer to the message header).  When the destination is
+finished with the message, it asynchronously notifies the originator
+by marking the message as handled.  The originator is then responsible
+for updating the slab / ring allocator with the information that the
+message is freed.
+
+This method ensures that there is only a single owner of a message at
+a time, and that the originator is the system which does all of the
+work for managing memory, lists, etc.
+
+The messages consist of (usually)
+
+* message status (free / allocated / handled)
+* message type (send packet, other signal)
+* type-specific information
+* packet type
+** packet ID (globally unique number)
+** socket information (likely a pointer to the socket information, with src/dst ip/port, file descriptor, etc.)
+** ring buffer pointer
+** pointer / size to raw packet data (if required)
+* signal
+** packet ID
+** requested status (done, dup, socket went away, etc.)
+* other ...
+
+When the receiver is done with the message, it marks it as "handled",
+and lets the originator do asynchronous cleanups.
+
+If the receiver has to sit on the data for a long time
+(e.g. `cleanup_delay`), it has to copy the message and any associated
+packet data into a separate buffer.  This copy ensures that the
+message headers and packet data ring buffer contain information for as
+short a period of time as possible.
+
+### Packet Data Ring Buffer
+
+The packet data ring buffer is a way to avoid memory copies, and a way
+to deal with variable-sized packets.
+
+The ring buffer starts off at some size (.e.g. 64K), and tries to read
+the maximum sized packet into it.  It generally receives a smaller
+packet (e.g. 200 bytes), in which case the "start" pointer can be
+incremented, and the next packet read into the ring.
+
+Since the "create message" API is used only by one caller
+(e.g. network thread or worker thread), there is no issue with
+overlapping / simultaneous calls to allocate messages.
+
+## Implementation Trade-offs
+
+Ring buffers / message APIs are tied to a socket (for network
+threads), or to a source+socket (worker threads).  i.e. a worker has
+one outbound message set per network destination.  A network
+destination only has one outbound message set, over all workers.
+
+There is no perfect solution here.  The messages are intended to be
+short-lived, *but* may be long-lived for things like `cleanup_delay`.
+Messages to worker threads are short-lived, hence only one outgoing
+message set.
+
+If the message set is full, it is doubled in size (up to a limit),
+which gives room for more messages.
+
+The memory and structures are owned by the originator, and only
+cleaned up by the originator.  The recipient of the messages accesses
+the data, but doesn't write to it.  i.e. it only writes to the message
+header as an async signal saying "done with this message".
+
+In order to prevent cache line thrashing, the originator only checks
+for "done" messages when (1) too many messages are outstanding, or (2)
+when the ring buffer / message array is full.
+
+When those limits are set, the message API tries to clean up old
+messages.  If it's successful, allocation proceeds with the current
+ring buffer / message array.  Otherwise, a new message array and/or
+ring buffer is allocated, at double the size of the old one.
+
+Message arrays (MA) and ring buffers (RB) are tracked in a fixed-size
+array (not linked list).  An array size of 20 should be enough for 20
+doublings... at which point the system is likely out of memory.
+
+For consolidation, if we have more than two (2) array MA/RB entries
+available, and the first N are free, we free the smallest ones, and
+coalesce the array entries so that we have the smallest number of
+MA/RB entries, each of which is as large as possible.
+
+If the network thread needs to perform de-dup, `cleanup_delay`, or
+other "wait on socket", it just leaves the reply packets in the
+messages.  The worker thread will then allocate larger buffers if
+necessary, *or* just start tossing replies.
+
+If the network thread can't write to TCP socket, it also removes the
+socket from the "read" portion of the event loop.  This change ensures
+that the server isn't reading packets faster than the other end can
+receive replies.  We then rely on TCP for flow control back to the
+sender.  When the socket becomes ready for writing, it is also added
+back to the "read" portion of the event loop.
+
+TCP sockets will need to track ongoing packets, so that they can be
+messaged "stop" when the TCP socket goes away.  UDP sockets need this
+for authentication packets, but also for accounting, with conflicting
+packets.  i.e. "you can keep processing the old packet, but don't
+bother sending the reply, as the client will not accept it"
+
+When the UDP sockets do accounting tracking, they just track the
+minimum information necessary to detect duplicates.  When a reply
+comes from a conflicting packet, the network thread can quench the
+reply by noticing it's for an old packet, and not for the new one
+(???)  Or, the network thread has to track the old packet (??) and
+send an sync signal to the worker that the request is dead, and the
+worker should stop processing it.
+
+When UDP sockets do dedup / `cleanup_delay` detection, they track the
+minimum information necessary to detect duplicates, along with a
+pointer to the message for the reply.  If a new packet comes in, the
+old message is marked "done".  If a dup comes in, the same reply is
+sent.
+
+For outgoing packets, if the server is lightly loaded, caching packets
+for ~5s isn't an issue.  And uses less memory than what we use now,
+where we cache all of the incoming packet, `REQUEST`, and outgoing
+packet.
+
+If the server is heavily loaded, then in the general case, new packets
+coming in will clear the outgoing packets.  When outgoing packets
+aren't cleared, we can just take the 1/1000 one, copy it to a local
+buffer, and then clear the incoming message.
+
+This tracking could be done by delays (i.e. if packets sit in the
+outoging buffer for "too long"), tho it's hard to tell what "too long"
+means.  Instead, it should be self-clocked.  i.e. if 99% of outgoing
+packets have been cleaned up, we should probably take the 1%, and
+"localize" them.
+
+## API
+
+The APIs here are documented in reverse order.
+
+### Ring Buffer API
+
+The ring buffers are only used by the message layer, and aren't
+directly accessible by the message originator.
+
+Each message API has one ring buffer associated with it, as above.
+The buffer has a fixed size.  The `reader` offset is where messages
+are read from.  The `writer` offset is where messages are written to.
+
+When a message is freed, the `reader` pointer is incremented.  If
+`reader == writer`, then the ring buffer is empty, and both are reset
+to zero.
+
+If the writer gets too close to the end (i.e. `writer + max >= size`),
+then `writer` is reset to zero.
+
+If `reader > write && (writer + max >= reader)`, then there isn't
+enough room to allocate a maximum sized packet in the ring buffer.  A
+new ring buffer is allocated, twice the size of the current one.  It
+is made as the default ring buffer for the messages.
+
+The old ring buffer is kept around until `reader == writer`, and then
+it is discarded.
+
+The message API keeps track of the current and old ring buffers.
+
+Data is read into the buffer via an underlying IO
+implementation.  Then, packets are decoded from the ring buffwr.
+
+Packet allocations from the ring buffer are rounded up to the nearest
+cache line size (64 bytes).  This prevents false sharing.
+
+* create - create a ring buffer
+* destroy - free a ring buffer (maybe talloc?)
+* used - tell the ring buffer that N bytes of data have been used from the ring buffer
+* unused - tell the ring buffer than N bytes of data are unused at the start.
+
+Note that the read / write is done on raw sizes (e.g. 11 bytes).  If
+the caller needs cache line alignment, it does so itself.  This is
+because the ring buffer is for both UDP (where each packet is
+individual), and for TCP (packets are streamed together).
+
+We don't use `recvmmsg()`, because it's broken.  The timeout is
+checked only after a packet is received.  So if you ask for N packets,
+and it receives M<N packets... it never returns (even if the timeout
+hits), until the next packet is received.  *Maybe `MSG_WAITFORONE` is
+better?* If that's true, then we don't want a ring buffer, as it's
+possible to receive multiple messages.
+
+Note that the ring buffer doesn't keep track of where the packet start
+/ end is.  It trusts the caller to track that information.
+
+Similarly, the ring buffer API doesn't track previous / next ring
+buffers, it relies on the caller to do that.
+
+!Diagram of Ring Buffers
+
+!Diagram of Ring Buffer alloc / free
+
+### Message API
+
+The message API is about allocating a message, and filling it out.
+
+!Diagram of messages
+
+### Queue API
+
+The Queue API is about inserting / removing elements from a FIFO
+queue.  The `insert` function returns a special flag if the queue was
+empty, so that the originator can poke the receiver that another message is ready?
+
+Or, the even loop / FD has to be exposed to the queue API, so that the
+queue code can do this signalling itself.
-- 
2.47.3