BUG/MEDIUM: queue: always dequeue the backend when redistributing the last server
An interesting bug was revealed by commit
5541d4995d ("BUG/MEDIUM: queue:
deal with a rare TOCTOU in assign_server_and_queue()"). When shutting
down a server to redistribute its connections, no check is made on the
backend's queue. If we're turning off the last server and the backend
has pending connections, these ones will wait there till the queue
timeout. But worse, since the commit above, we can enter an endless loop
in the following situation:
- streams are present in the backend's queue
- streams are purged on the last server via srv_shutdown_streams()
- that one calls pendconn_redistribute(srv) which does not purge
the backend's pendconns
- a stream performs some load balancing and enters assign_server_and_queue()
- assign_server() is called in turn
- the LB algo is non-deterministic and there are entries in the
backend's queue. The function notices it and returns SRV_STATUS_FULL
- assign_server_and_queue() calls pendconn_add() to add the connection
to the backend's queue
- on return, pendconn_must_try_again() is called, it figures there's
no stream served anymore on the server nor the proxy, so it removes
the pendconn from the queue and returns 1
- assign_server_and_queue() loops back to the beginning to try again,
while the conditions have not changed, resulting in an endless loop.
Ideally a change count should be used in the queues so that it's possible
to detect that some dequeuing happened and/or that a last stream has left.
But that wouldn't completely solve the problem that is that we must never
ever add to a queue when there's no server streams to dequeue the new
entries.
The current solution consists in making pendconn_redistribute() take care
of the proxy after the server in case there's no more server available on
the proxy. It at least ensures that no pending streams are left in the
backend's queue when shutting streams down or when the last server goes
down. The try_again loop remains necessary to deal with inevitable races
during pendconn additions. It could be limited to a few rounds, though,
but it should never trigger if the conditions are sufficient to permit
it to converge.
One way to reproduce the issue is to run a config with a single server
with maxconn 1 and plenty of threads, then run in loops series of:
"disable server px/s;shutdown sessions server px/s;
wait 100ms server-removable px/s; show servers conn px;
enable server px/s"
on the CLI at ~10/s while injecting with around 40 concurrent conns at
40-100k RPS. In this case in 10s - 1mn the crash can appear with a
backtrace like this one for at least 1 thread:
#0 pendconn_add (strm=strm@entry=0x17f2ce0) at src/queue.c:487
#1 0x000000000064797d in assign_server_and_queue (s=s@entry=0x17f2ce0) at src/backend.c:1064
#2 0x000000000064a928 in srv_redispatch_connect (s=s@entry=0x17f2ce0) at src/backend.c:1962
#3 0x000000000064ac54 in back_handle_st_req (s=s@entry=0x17f2ce0) at src/backend.c:2287
#4 0x00000000005ae1d5 in process_stream (t=t@entry=0x17f4ab0, context=0x17f2ce0, state=<optimized out>) at src/stream.c:2336
It's worth noting that other threads may often appear waiting after the
poller and one in server_atomic_sync() waiting for isolation, because
the event that is processed when shutting the server down is consumed
under isolation, and having less threads available to dequeue remaining
requests increases the probability to trigger the problem, though it is
not at all necessary (some less common traces never show them).
This should carefully be backported wherever the commit above was
backported.