As reported by Damien Claisse and Cédric Paillet, the "random" LB
algorithm can become particularly unfair with large numbers of servers
having few connections. It's indeed fairly common to see many servers
with zero connection in a thousand-server large farm, and in this case
the P2C algo consisting in checking the servers' loads doesn't help at
all and is basically similar to random(1). In this case, we only rely
on the distribution of server IDs in the random space to pick the best
server, but it's possible to observe huge discrepancies.
An attempt to model the problem clearly shows that with 1600 servers
with weight 10, for 1 million requests, the lowest loaded ones will
take 300 req while the most loaded ones will get 780, with most of
the values between 520 and 700.
In addition, only the first 28 lower bits of server IDs are used for
the key calculation, which means that node keys are more determinist.
Setting random keys in the lowest 28 bits only better packs values
with min around 530 and max around 710, with values mostly between
550 and 680.
This can only be compensated by increasing weights and draws without
being a perfect fix either. At 4 draws, the min is around 560 and the
max around 670, with most values bteween 590 and 650.
This patch takes another approach to this problem: when servers are on
tie regarding their loads, instead of arbitrarily taking the second one,
we now compare their current request rates, which is updated all the
time and smoothed over one second, and we pick the server with the
lowest request rate. Now with 2 draws, the curve is mostly flat, with
the min at 580 and the max at 628, and almost all values between 611
and 625. And 4 draws exclusively gives values from 614 to 624.
Other points will need to be addressed separately (bits of server ID,
maybe refine the hash algorithm), but these ones would affect how
caches are selected, and cannot be changed without an extra option.
For random however we can perform a change without impacting anyone.
This should be backported, probably only to 3.3 since it's where the
"random" algo became the default.
will take away N-1 of the highest loaded servers at the
expense of performance. With very high values, the algorithm
will converge towards the leastconn's result but much slower.
+ In addition, for large server farms with very low loads (or
+ perfect balance), comparing loads will often lead to a tie,
+ so in case of equal loads between all measured servers, their
+ request rate over the last second are compared, which allows
+ to better balance server usage over time in the same spirit
+ as roundrobin does, and smooth consistent hash unfairness.
The default value is 2, which generally shows very good
- distribution and performance. This algorithm is also known as
+ distribution and performance. For large farms with low loads
+ (less than a few requests per second per server), it may help
+ to raise it to 3 or even 4. This algorithm is also known as
the Power of Two Random Choices and is described here :
http://www.eecs.harvard.edu/~michaelm/postscripts/handbook2001.pdf
/* compare the new server to the previous best choice and pick
* the one with the least currently served requests.
*/
- if (prev && prev != curr &&
- curr->served * prev->cur_eweight > prev->served * curr->cur_eweight)
- curr = prev;
+ if (prev && prev != curr) {
+ uint64_t wcurr = (uint64_t)curr->served * prev->cur_eweight;
+ uint64_t wprev = (uint64_t)prev->served * curr->cur_eweight;
+
+ if (wcurr > wprev)
+ curr = prev;
+ else if (wcurr == wprev && curr->counters.shared.tg && prev->counters.shared.tg) {
+ /* same load: pick the lowest weighted request rate */
+ wcurr = read_freq_ctr_period_estimate(&curr->counters._sess_per_sec, MS_TO_TICKS(1000));
+ wprev = read_freq_ctr_period_estimate(&prev->counters._sess_per_sec, MS_TO_TICKS(1000));
+ if (wprev * curr->cur_eweight < wcurr * prev->cur_eweight)
+ curr = prev;
+ }
+ }
} while (--draws > 0);
/* if the selected server is full, pretend we have none so that we reach