RTR Server: thread-pool.server.max now refers to RTR requests
Apparently, there was a huge misunderstanding when the thread pool was
implemented.
The intended model was
> When the RTR server receives a request, it borrows a thread from the
> thread pool, and tasks it with the request.
Which is logical and a typical thread pool use case. However, what was
actually implemented was
> When the RTR server opens a connection, it borrows a thread from the
> thread pool, and tasks it with the whole connection.
So `thread-pool.server.max` was a hard limit for simultaneous RTR
clients (routers), but now it's just a limit to simultaneous RTR
requests. (Surplus requests will queue.) This is much less taxing to the
CPU when there are hundreds of clients.
Thanks to Mark Tinka for basically spelling this out to me.
-----------------------
Actually, this commit is an almost entire rewrite of the RTR server
core. Here's a (possibly incomplete) list of other problems I had to fix
in the process:
== Problem 1 ==
sockaddr2str() was returning a pointer to invalid memory on success.
This was due to a naive attempt of a bugfix from
1ff403a0c7f61d443cbc4e2e512b8d0324547856.
== Problem 2 ==
Changed the delta expiration conditional.
Was "keep track of the clients, expire deltas when all clients outgrow
them." I see two problems with that:
1. It'll lead to bad performance if a client misbehaves by not
maintaining the connection. (ie. the server will have to fall back to
too many cache resets.)
2. It might keep the deltas forever if a client bugs out without killing
the connection.
New conditional is "keep deltas for server.deltas.lifetime iterations."
"server.deltas.lifetime" is a new configuration argument.
== Problem 3 ==
Serials weren't being compared according to RFC 1982 serial arithmetic.
This was going to cause mayhem when the integer wrapped.
(Though Fort always starts at 1, and serials are 32-bit unsigned
integers, so this wasn't going to be a problem for a very long time.)
== Problem 4 ==
The thread pool had an awkward termination bug. When threads were
suspended, they were meant to be ended through a pthread signal, but
when they were running, they were supposed to be terminated through
pthread_cancel(). (Because, since each client was assigned a thread,
they would spend most of their time sleeping.) These termination methods
don't play well with each other.
Apparently, threads waiting on a signal cannot be canceled, because of
this strange quirk from man 3 pthread_cond_wait:
> a side effect of acting upon a cancellation request while in a
> condition wait is that the mutex is (in effect) re-acquired before
> calling the first cancellation cleanup handler.
(So the first thread dies with the mutex locked, and no other threads
can be canceled because no one can ever lock the mutex again.)
And of course, you can't stop a server thread through a signal, because
they aren't listening to it; they're sleeping in wait for a request.
I still don't really know how would I fix this, but luckily, the problem
no longer exists since working threads are mapped to single requests,
and therefore no longer sleep. (For long periods of time, anyway.)
So always using the signal works fine.