From: Willy Tarreau Date: Thu, 18 Sep 2025 13:08:12 +0000 (+0200) Subject: OPTIM: ring: check the queue's owner using a CAS on x86 X-Git-Tag: v3.3-dev9~53 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=a727c6eaa54f95e72a45b98c2d5ff9d89ac54448;p=thirdparty%2Fhaproxy.git OPTIM: ring: check the queue's owner using a CAS on x86 In the loop where the queue's leader tries to get the tail lock, we also need to check if another thread took ownership of the queue the current thread is currently working for. This is currently done using an atomic load. Tests show that on x86, using a CAS for this is much more efficient because it allows to keep the cache line in exclusive state for a few more cycles that permit the queue release call after the loop to be done without having to wait again. The measured gain is +5% for 128 threads on a 64-core AMD system (11.08M msg/s vs 10.56M). However, ARM loses about 1% on this, and we cannot afford that on machines without a fast CAS anyway, so the load is performed using a CAS only on x86_64. It might not be as efficient on low-end models but we don't care since they are not the ones dealing with high contention. --- diff --git a/src/ring.c b/src/ring.c index 8a97b37c0..79f023aa1 100644 --- a/src/ring.c +++ b/src/ring.c @@ -275,7 +275,18 @@ ssize_t ring_write(struct ring *ring, size_t maxlen, const struct ist pfx[], siz */ while (1) { - if ((curr_cell = HA_ATOMIC_LOAD(ring_queue_ptr)) != &cell) +#if defined(__x86_64__) + /* read using a CAS on x86, as it will keep the cache line + * in exclusive state for a few more cycles that will allow + * us to release the queue without waiting after the loop. + */ + curr_cell = &cell; + HA_ATOMIC_CAS(ring_queue_ptr, &curr_cell, curr_cell); +#else + curr_cell = HA_ATOMIC_LOAD(ring_queue_ptr); +#endif + /* give up if another thread took the leadership of the queue */ + if (curr_cell != &cell) goto wait_for_flush; /* OK the queue is locked, let's attempt to get the tail lock.