There's still a lot of contention when accessing the backend's
totpend and queueslength for every request in may_dequeue_tasks(),
even when queues are not used. This only happens because it's stored
in the same cache line as >beconn which is being written by other
threads:
0.01 | call sess_change_server
0.02 | mov 0x188(%r15),%esi ## s->queueslength
| if (may_dequeue_tasks(srv, s->be))
0.00 | mov 0xa8(%r12),%rax
0.00 | mov -0x50(%rbp),%r11d
0.00 | mov -0x60(%rbp),%r10
0.00 | test %esi,%esi
| jne 3349
0.01 | mov 0xa00(%rax),%ecx ## p->queueslength
8.26 | test %ecx,%ecx
4.08 | je 288d
This patch moves queueslength and totpend to their own cache line,
thus adding 64 bytes to the struct proxy, but gaining 3.6% of RPS
on a 64-core EPYC thanks to the elimination of this false sharing.
process_stream() goes down from 3.88% to 3.26% in perf top, with
the next top users being inc/dec (s->served) and be->beconn.
EXTRA_COUNTERS(extra_counters_be);
THREAD_ALIGN();
- unsigned int queueslength; /* Sum of the length of each queue */
+ /* these ones change all the time */
int served; /* # of active sessions currently being served */
- int totpend; /* total number of pending connections on this instance (for stats) */
unsigned int feconn, beconn; /* # of active frontend and backends streams */
+
+ THREAD_ALIGN();
+ /* these ones are only changed when queues are involved, but checked
+ * all the time.
+ */
+ unsigned int queueslength; /* Sum of the length of each queue */
+ int totpend; /* total number of pending connections on this instance (for stats) */
};
struct switching_rule {