From: Willy Tarreau <w@1wt.eu>
Date: Thu, 16 Apr 2026 08:48:43 +0000 (+0200)
Subject: MEDIUM: threads: change the default max-threads-per-group value to 16
X-Git-Url: http://git.ipfire.org/index.cgi?a=commitdiff_plain;h=0af603f46f;p=thirdparty%2Fhaproxy.git

MEDIUM: threads: change the default max-threads-per-group value to 16

A lot of our subsystems start to be shared by thread groups now
(listeners, queues, stick-tables, stats, idle connections, LB algos).
This has allowed to recover the performance that used to be out of
reach on losely shared platforms (typically AMD EPYC systems), but in
parallel other large unified systems (Xeon and large Arm in general)
still suffer from the remaining contention when placing too many
threads in a group.

A first test running on a 64-core Neoverse-N1 processor with a single
backend with one server and no LB algo specifiied shows 1.58 Mrps with
64 threads per group, and 1.71 Mrps with 16 threads per group. The
difference is essentially spent updating stats counters everywhere.

Another test is the connection:close mode, delivering 85 kcps with
64 threads per group, and 172 kcps (202%) with 16 threads per group.
In this case it's mostly the more numerous listeners which improve
the situation as the change is mostly in the kernel:

max-threads-per-group 64:
  # perf top
  Samples: 244K of event 'cycles', 4000 Hz, Event count (approx.): 61065854708 los
  Overhead  Shared Object     Symbol
    10.41%  [kernel]          [k] queued_spin_lock_slowpath
    10.36%  [kernel]          [k] _raw_spin_unlock_irqrestore
     2.54%  [kernel]          [k] _raw_spin_lock
     2.24%  [kernel]          [k] handle_softirqs
     1.49%  haproxy           [.] process_stream
     1.22%  [kernel]          [k] _raw_spin_lock_bh

  # h1load
  time conns tot_conn  tot_req      tot_bytes    err  cps  rps  bps   ttfb
     1  1024    84560    83536        4761666      0 84k5 83k5 38M0 11.91m
     2  1024   168736   167713        9559698      0 84k0 84k0 38M3 11.98m
     3  1024   253865   252841       14412165      0 85k0 85k0 38M7 11.84m
     4  1024   339143   338119       19272783      0 85k1 85k1 38M8 11.80m
     5  1024   424204   423180       24121374      0 84k9 84k9 38M7 11.86m

max-threads-per-group 16:
  # perf top
  Samples: 1M of event 'cycles', 4000 Hz, Event count (approx.): 375998622679 lost
  Overhead  Shared Object     Symbol
    15.20%  [kernel]          [k] queued_spin_lock_slowpath
     4.31%  [kernel]          [k] _raw_spin_unlock_irqrestore
     3.33%  [kernel]          [k] handle_softirqs
     2.54%  [kernel]          [k] _raw_spin_lock
     1.46%  haproxy           [.] process_stream
     1.12%  [kernel]          [k] _raw_spin_lock_bh

  # h1load
      time conns tot_conn  tot_req      tot_bytes    err  cps  rps  bps   ttfb
         1  1020   172230   171211        9759255      0 172k 171k 78M0 5.817m
         2  1024   343482   342460       19520277      0 171k 171k 78M0 5.875m
         3  1021   515947   514926       29350953      0 172k 172k 78M5 5.841m
         4  1024   689972   688949       39270207      0 173k 173k 79M2 5.783m
         5  1024   863904   862881       49184274      0 173k 173k 79M2 5.795m

So let's change the default value to 16. It also happens to match what's
used by default on EPYC systems these days.

This change was marked MEDIUM as it will increase the number of listening
sockets on some systems, to match their counter parts from other vendors,
which is easier for capacity planning.
---

diff --git a/doc/configuration.txt b/doc/configuration.txt
index 230747968..67f4337af 100644
--- a/doc/configuration.txt
+++ b/doc/configuration.txt
@@ -3037,11 +3037,17 @@ master-worker no-exit-on-failure
 
 max-threads-per-group <number>
   Defines the maximum number of threads in a thread group. Unless the number
-  of thread groups is fixed with the thread-groups directive, haproxy will
-  create more thread groups if needed. The default and maximum value is 64.
-  Having a lower value means more groups will potentially be created, which
-  can help improve performances, as a number of data structures are per
-  thread group, and that will mean less contention
+  of thread groups is fixed with the "thread-groups" directive, haproxy will
+  create as many thread groups as needed to satisfy the requested number of
+  threads. Tha minimum value is 1, and the maximum value is 64 (on 64-bit
+  systems, or 32 on 32-bit systems). Lower values reduce contention caused by
+  atomic operations on shared states, but can increase the number of sockets
+  needed to create all listeners and to hold idle backend connections. Higher
+  values will reduce these costs, at the expense of higher CPU usage under
+  contented situations, and lower connection rates. The default value is 16,
+  which provides the best tradeoff that was experimentally found on various
+  tested systems, including x86_64 processors from multiple vendors, and large
+  Arm64 systems, both on bare metal and hypervisors.
 
 mworker-max-reloads <number>
   In master-worker mode, this option limits the number of time a worker can
diff --git a/include/haproxy/defaults.h b/include/haproxy/defaults.h
index 2c9896047..54e733d12 100644
--- a/include/haproxy/defaults.h
+++ b/include/haproxy/defaults.h
@@ -49,6 +49,15 @@
 
 #define MAX_THREADS_PER_GROUP __WORDSIZE
 
+/* Default value for the maximum number of threads per group. Thread counts
+ * beyond this value will induce the creation of new thread groups and thus
+ * limit contention on highly accessed areas. The value may be changed between
+ * 1 and MAX_THREADS_PER_GROUP via the global "max-threads-per-group" setting.
+ */
+#ifndef DEF_MAX_THREADS_PER_GROUP
+#define DEF_MAX_THREADS_PER_GROUP 16
+#endif
+
 /* threads enabled, max_threads defaults to long bits for 1 tgroup or 4 times
  * long bits if more tgroups are enabled.
  */
diff --git a/src/haproxy.c b/src/haproxy.c
index 39d1a0ac6..292679dfe 100644
--- a/src/haproxy.c
+++ b/src/haproxy.c
@@ -209,7 +209,7 @@ struct global global = {
 #endif
 	/* by default allow clients which use a privileged port for TCP only */
 	.clt_privileged_ports = HA_PROTO_TCP,
-	.maxthrpertgroup = MAX_THREADS_PER_GROUP,
+	.maxthrpertgroup = DEF_MAX_THREADS_PER_GROUP,
 	/* others NULL OK */
 };