]> git.ipfire.org Git - thirdparty/haproxy.git/commitdiff
DOC: design-thoughts: commit numa-auto.txt
authorWilly Tarreau <w@1wt.eu>
Tue, 5 Sep 2023 07:20:15 +0000 (09:20 +0200)
committerWilly Tarreau <w@1wt.eu>
Fri, 14 Mar 2025 17:30:30 +0000 (18:30 +0100)
Lots of collected data and observations aggregated into a single commit
so as not to lose them. Some parts below come from several commit
messages and are incremental.

Add captures and analysis of intel 14900 where it's not easy to draw
the line between the desired P and E cores.

The 14900 raises some questions (imagine a dual-die variant in multi-socket).
That's the start of an algorithmic distribution of performance cores into
thread groups.

cpu-map currently conflicts a lot with the choices after auto-detection
but it doesn't have to. The problem is the inability to configure the
threads for the whole process like taskset does. By offering this ability
we can also start to designate groups of CPUs symbolically (package, die,
ccx, cores, smt).

It can also be useful to exploit the info from cpuinfo that is not
available in /sys, such as the model number. At least on arm, higher
numbers indicate bigger cores and can be useful to distinguish cores
inside a cluster. It will not indicate big vs medium ones of the same
type (e.g. a78 3.0 vs 2.4 GHz) but can still be effective at identifying
the efficient ones.

In short, infos such as cluster ID not always reliable, and are
local to the package. die_id as well. die number is not reported
here but should definitely be used, as a higher priority than L3.

We're still missing a discriminant between the l3 and cluster number
in order to address heterogenous CPUs (e.g. intel 14900), though in
terms of locality that's currently done correctly.

CPU selection is also a full topic, and some thoughts were noted
regarding sorting by perf vs locality so as never to mix inter-
socket CPUs due to sorting.

The proposed cpu-selection cannot work as-is, because it acts both on
restriction and preference, and these two are not actions but a sequence.
First restrictions must be enforced, and second the remaining CPUs are
sorted according to the preferred criterion, and a number of threads are
selected.

Currently we refine the OS-exposed cluster number but it's not correct
as we can end up with something poorly numbered. We need to respect the
LLC in any case so let's explain the approach.

doc/design-thoughts/numa-auto.txt [new file with mode: 0644]

diff --git a/doc/design-thoughts/numa-auto.txt b/doc/design-thoughts/numa-auto.txt
new file mode 100644 (file)
index 0000000..c58695b
--- /dev/null
@@ -0,0 +1,1458 @@
+2023-07-04 - automatic grouping for NUMA
+
+
+Xeon: (W2145)
+
+willy@debian:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0,8
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0,8
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,8
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-15
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+
+Wtap: i7-8650U
+
+willy@wtap:~ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0,4
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0,4
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,4
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-7
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+
+pcw: i7-6700k
+
+willy@pcw:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0,4
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0,4
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,4
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-7
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+
+nfs: N5105, v5.15
+
+willy@nfs:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-3
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-3
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+
+eeepc: Atom N2800, 5.4 : no L3, L2 not shared.
+
+willy@eeepc:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+
+willy@eeepc:~$ grep '' /sys/devices/system/cpu/cpu2/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu2/cache/index0/shared_cpu_list:2-3
+/sys/devices/system/cpu/cpu2/cache/index1/shared_cpu_list:2-3
+/sys/devices/system/cpu/cpu2/cache/index2/shared_cpu_list:2-3
+/sys/devices/system/cpu/cpu2/cache/index0/type:Data
+/sys/devices/system/cpu/cpu2/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu2/cache/index2/type:Unified
+
+
+dev13: Ryzen 2700X
+
+haproxy@dev13:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-7
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+haproxy@dev13:~$ grep '' /sys/devices/system/cpu/cpu8/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8-9
+/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8-9
+/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8-9
+/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-15
+/sys/devices/system/cpu/cpu8/cache/index0/type:Data
+/sys/devices/system/cpu/cpu8/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu8/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu8/cache/index3/type:Unified
+
+
+dev12: Ryzen 5800X
+
+haproxy@dev12:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0,8
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0,8
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,8
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-15
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+
+amd24: EPYC 74F3
+
+willy@mt:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0,24
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0,24
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,24
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-2,24-26
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+willy@mt:~$ grep '' /sys/devices/system/cpu/cpu8/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,32
+/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,32
+/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,32
+/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:6-8,30-32
+/sys/devices/system/cpu/cpu8/cache/index0/type:Data
+/sys/devices/system/cpu/cpu8/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu8/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu8/cache/index3/type:Unified
+
+willy@mt:~$ grep '' /sys/devices/system/cpu/cpu0/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0,24
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-47
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0-47
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-47
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0,24
+
+
+xeon24: Gold 6212U
+
+willy@mt01:~$ grep '' /sys/devices/system/cpu/cpu8/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,32
+/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,32
+/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,32
+/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:0-47
+/sys/devices/system/cpu/cpu8/cache/index0/type:Data
+/sys/devices/system/cpu/cpu8/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu8/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu8/cache/index3/type:Unified
+
+
+SPR 8480+
+
+$ grep -a '' /sys/devices/system/node/node*/cpulist
+/sys/devices/system/node/node0/cpulist:0-55,112-167
+/sys/devices/system/node/node1/cpulist:56-111,168-223
+
+$ grep -a '' /sys/devices/system/cpu/cpu0/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0,112
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-55,112-167
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0-55,112-167
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-55,112-167
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0,112
+
+$ grep -a '' /sys/devices/system/cpu/cpu0/cache/*/shared_cpu_list
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0,112
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0,112
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,112
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-55,112-167
+
+
+UP Board - Atom X5-8350 : no L3, exactly like Armada8040
+
+willy@up1:~$ grep '' /sys/devices/system/cpu/cpu{0,1,2,3}/cache/index2/*list
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu2/cache/index2/shared_cpu_list:2-3
+/sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_list:2-3
+
+willy@up1:~$ grep '' /sys/devices/system/cpu/cpu0/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-3
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+Atom D510 - kernel 2.6.33
+
+$ strings -fn1 sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list: 0,2
+sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list: 0,2
+sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list: 0,2
+sys/devices/system/cpu/cpu0/cache/index0/type: Data
+sys/devices/system/cpu/cpu0/cache/index1/type: Instruction
+sys/devices/system/cpu/cpu0/cache/index2/type: Unified
+
+$ strings -fn1 sys/devices/system/cpu/cpu?/topology/*list
+sys/devices/system/cpu/cpu0/topology/core_siblings_list: 0-3
+sys/devices/system/cpu/cpu0/topology/thread_siblings_list: 0,2
+sys/devices/system/cpu/cpu1/topology/core_siblings_list: 0-3
+sys/devices/system/cpu/cpu1/topology/thread_siblings_list: 1,3
+sys/devices/system/cpu/cpu2/topology/core_siblings_list: 0-3
+sys/devices/system/cpu/cpu2/topology/thread_siblings_list: 0,2
+sys/devices/system/cpu/cpu3/topology/core_siblings_list: 0-3
+sys/devices/system/cpu/cpu3/topology/thread_siblings_list: 1,3
+
+mcbin: Armada 8040 : no L3, no difference with L3 not reported
+
+root@lg7:~# grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+
+root@lg7:~# grep '' /sys/devices/system/cpu/cpu0/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-3
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-3
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+
+Ampere/monolithic: Ampere Altra 80-26 : L3 not reported
+
+willy@ampere:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+
+willy@ampere:~$ grep '' /sys/devices/system/cpu/cpu0/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-79
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-79
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+
+Ampere/Hemisphere: Ampere Altra 80-26 : L3 not reported
+
+willy@ampere:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+
+willy@ampere:~$ grep '' /sys/devices/system/cpu/cpu0/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-79
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-79
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+willy@ampere:~$ grep '' /sys/devices/system/node/node*/cpulist
+/sys/devices/system/node/node0/cpulist:0-39
+/sys/devices/system/node/node1/cpulist:40-79
+
+
+LX2A: LX2160A => L3 not reported
+
+willy@lx2a:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+
+willy@lx2a:~$ grep '' /sys/devices/system/cpu/cpu2/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu2/cache/index0/shared_cpu_list:2
+/sys/devices/system/cpu/cpu2/cache/index1/shared_cpu_list:2
+/sys/devices/system/cpu/cpu2/cache/index2/shared_cpu_list:2-3
+/sys/devices/system/cpu/cpu2/cache/index0/type:Data
+/sys/devices/system/cpu/cpu2/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu2/cache/index2/type:Unified
+
+willy@lx2a:~$ grep '' /sys/devices/system/cpu/cpu0/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-15
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-15
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+
+Rock5B: RK3588 (big-little A76+A55)
+
+rock@rock-5b:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-7
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+rock@rock-5b:~$ grep '' /sys/devices/system/cpu/cpu{0,4,6}/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-3
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-3
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+/sys/devices/system/cpu/cpu4/topology/core_cpus_list:4
+/sys/devices/system/cpu/cpu4/topology/core_siblings_list:4-5
+/sys/devices/system/cpu/cpu4/topology/die_cpus_list:4
+/sys/devices/system/cpu/cpu4/topology/package_cpus_list:4-5
+/sys/devices/system/cpu/cpu4/topology/thread_siblings_list:4
+/sys/devices/system/cpu/cpu6/topology/core_cpus_list:6
+/sys/devices/system/cpu/cpu6/topology/core_siblings_list:6-7
+/sys/devices/system/cpu/cpu6/topology/die_cpus_list:6
+/sys/devices/system/cpu/cpu6/topology/package_cpus_list:6-7
+/sys/devices/system/cpu/cpu6/topology/thread_siblings_list:6
+
+$ grep '' /sys/devices/system/cpu/cpu*/cpu_capacity
+/sys/devices/system/cpu/cpu0/cpu_capacity:414
+/sys/devices/system/cpu/cpu1/cpu_capacity:414
+/sys/devices/system/cpu/cpu2/cpu_capacity:414
+/sys/devices/system/cpu/cpu3/cpu_capacity:414
+/sys/devices/system/cpu/cpu4/cpu_capacity:1024
+/sys/devices/system/cpu/cpu5/cpu_capacity:1024
+/sys/devices/system/cpu/cpu6/cpu_capacity:1024
+/sys/devices/system/cpu/cpu7/cpu_capacity:1024
+
+
+Firefly: RK3399 (2xA72 + 4xA53) kernel 6.1.28
+
+root@firefly:~# grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+grep: /sys/devices/system/cpu/cpu0/cache/index?/shared_cpu_list: No such file or directory
+grep: /sys/devices/system/cpu/cpu0/cache/index?/type: No such file or directory
+
+root@firefly:~# grep '' /sys/devices/system/cpu/cpu*/cache/index?/{shared_cpu_list,type}
+grep: /sys/devices/system/cpu/cpu*/cache/index?/shared_cpu_list: No such file or directory
+grep: /sys/devices/system/cpu/cpu*/cache/index?/type: No such file or directory
+
+root@firefly:~# dmesg|grep cacheinfo
+[    0.006290] cacheinfo: Unable to detect cache hierarchy for CPU 0
+[    0.016339] cacheinfo: Unable to detect cache hierarchy for CPU 1
+[    0.017692] cacheinfo: Unable to detect cache hierarchy for CPU 2
+[    0.019050] cacheinfo: Unable to detect cache hierarchy for CPU 3
+[    0.020478] cacheinfo: Unable to detect cache hierarchy for CPU 4
+[    0.021660] cacheinfo: Unable to detect cache hierarchy for CPU 5
+[    1.990108] cacheinfo: Unable to detect cache hierarchy for CPU 0
+
+root@firefly:~# grep '' /sys/devices/system/cpu/cpu0/topology/*
+/sys/devices/system/cpu/cpu0/topology/cluster_cpus:0f
+/sys/devices/system/cpu/cpu0/topology/cluster_cpus_list:0-3
+/sys/devices/system/cpu/cpu0/topology/cluster_id:0
+/sys/devices/system/cpu/cpu0/topology/core_cpus:01
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/core_id:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings:3f
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-5
+/sys/devices/system/cpu/cpu0/topology/package_cpus:3f
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-5
+/sys/devices/system/cpu/cpu0/topology/physical_package_id:0
+/sys/devices/system/cpu/cpu0/topology/thread_siblings:01
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+$ grep '' /sys/devices/system/cpu/cpu*/cpu_capacity
+/sys/devices/system/cpu/cpu0/cpu_capacity:381
+/sys/devices/system/cpu/cpu1/cpu_capacity:381
+/sys/devices/system/cpu/cpu2/cpu_capacity:381
+/sys/devices/system/cpu/cpu3/cpu_capacity:381
+/sys/devices/system/cpu/cpu4/cpu_capacity:1024
+/sys/devices/system/cpu/cpu5/cpu_capacity:1024
+
+
+VIM3L: S905D3 (4*A55), kernel 5.14.10
+
+$ grep '' /sys/devices/system/cpu/cpu0/topology/*
+/sys/devices/system/cpu/cpu0/topology/core_cpus:1
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/core_id:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings:f
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-3
+/sys/devices/system/cpu/cpu0/topology/die_cpus:1
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/die_id:-1
+/sys/devices/system/cpu/cpu0/topology/package_cpus:f
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-3
+/sys/devices/system/cpu/cpu0/topology/physical_package_id:0
+/sys/devices/system/cpu/cpu0/topology/thread_siblings:1
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-3
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+
+$ grep '' /sys/devices/system/cpu/cpu*/cpu_capacity
+/sys/devices/system/cpu/cpu0/cpu_capacity:1024
+/sys/devices/system/cpu/cpu1/cpu_capacity:1024
+/sys/devices/system/cpu/cpu2/cpu_capacity:1024
+/sys/devices/system/cpu/cpu3/cpu_capacity:1024
+
+
+Odroid-N2: S922X (4*A73 + 2*A53), kernel 4.9.254
+
+willy@n2:~$ grep '' /sys/devices/system/cpu/cpu*/cache/index?/{shared_cpu_list,type}
+grep: /sys/devices/system/cpu/cpu*/cache/index?/shared_cpu_list: No such file or directory
+grep: /sys/devices/system/cpu/cpu*/cache/index?/type: No such file or directory
+
+willy@n2:~$ sudo dmesg|grep -i 'cache hi'
+[    0.649924] Unable to detect cache hierarchy for CPU 0
+
+No capacity.
+
+Note that it reports 2 physical packages!
+
+willy@n2:~$ grep '' /sys/devices/system/cpu/cpu0/topology/*
+/sys/devices/system/cpu/cpu0/topology/core_id:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings:03
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-1
+/sys/devices/system/cpu/cpu0/topology/physical_package_id:0
+/sys/devices/system/cpu/cpu0/topology/thread_siblings:01
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+willy@n2:~$ grep '' /sys/devices/system/cpu/cpu4/topology/*
+/sys/devices/system/cpu/cpu4/topology/core_id:2
+/sys/devices/system/cpu/cpu4/topology/core_siblings:3c
+/sys/devices/system/cpu/cpu4/topology/core_siblings_list:2-5
+/sys/devices/system/cpu/cpu4/topology/physical_package_id:1
+/sys/devices/system/cpu/cpu4/topology/thread_siblings:10
+/sys/devices/system/cpu/cpu4/topology/thread_siblings_list:4
+
+StarFive VisionFive2 - JH7110, kernel 5.15
+
+willy@starfive:~/haproxy$ ./haproxy -c -f cps3.cfg
+thr   0 -> cpu   0  onl=1 bnd=1 pk=00 no=-1 l3=-1 cl=000 l2=000 ts=000 l1=000
+thr   1 -> cpu   1  onl=1 bnd=1 pk=00 no=-1 l3=-1 cl=000 l2=000 ts=001 l1=001
+thr   2 -> cpu   2  onl=1 bnd=1 pk=00 no=-1 l3=-1 cl=000 l2=000 ts=002 l1=002
+thr   3 -> cpu   3  onl=1 bnd=1 pk=00 no=-1 l3=-1 cl=000 l2=000 ts=003 l1=003
+Configuration file is valid
+
+Graviton2 / Graviton3 ?
+
+
+On PPC64 not everything is available:
+
+  https://www.ibm.com/docs/en/linux-on-systems?topic=cpus-cpu-topology
+
+  /sys/devices/system/cpu/cpu<N>/topology/thread_siblings
+  /sys/devices/system/cpu/cpu<N>/topology/core_siblings
+  /sys/devices/system/cpu/cpu<N>/topology/book_siblings
+  /sys/devices/system/cpu/cpu<N>/topology/drawer_siblings
+
+  # lscpu -e
+  CPU NODE DRAWER BOOK SOCKET CORE L1d:L1i:L2d:L2i ONLINE CONFIGURED POLARIZATION ADDRESS
+  0   1    0      0    0      0    0:0:0:0         yes    yes        horizontal   0
+  1   1    0      0    0      0    1:1:1:1         yes    yes        horizontal   1
+  2   1    0      0    0      1    2:2:2:2         yes    yes        horizontal   2
+  3   1    0      0    0      1    3:3:3:3         yes    yes        horizontal   3
+  4   1    0      0    0      2    4:4:4:4         yes    yes        horizontal   4
+  5   1    0      0    0      2    5:5:5:5         yes    yes        horizontal   5
+  6   1    0      0    0      3    6:6:6:6         yes    yes        horizontal   6
+  7   1    0      0    0      3    7:7:7:7         yes    yes        horizontal   7
+  8   0    1      1    1      4    8:8:8:8         yes    yes        horizontal   8
+  ...
+
+Intel E5-2600v2/v3 has two L3:
+   https://www.enterpriseai.news/2014/09/08/intel-ups-performance-ante-haswell-xeon-chips/
+
+More info on these, and s390's "books" (mostly L4 in fact):
+   https://groups.google.com/g/fa.linux.kernel/c/qgAxjYq8ohI
+
+########################################
+Analysis:
+  - some server ARM CPUs (Altra, LX2) do not return any L3 info though they
+    DO have some. They stop at L2.
+
+  - other CPUs like Atom N2800 and Armada 8040 do not have L3.
+
+  => there's no apparent way to detect that the server CPUs do have an L3.
+  => or maybe we should consider that it's more likely that there is one
+     than none ? Armada works much better with groups than without. It's
+     basically the same topology as N2800.
+
+  => Do we really care then ? No L3 = same L3 for everyone. The problem is
+     that those really without L3 will make a difference on L2 while the
+     other ones not. Maybe we should consider that it does not make sense
+     to cut groups on L2 (i.e. under no circumstance we'll have one group
+     per core).
+
+  => This would mean:
+       - regardless of L3, consider LLC. If the LLC has more than one
+         core per instance, it's likely the last one (not true on LX2
+         but better use 8 groups of 2 than nothing).
+
+       - otherwise if there's a single core per instance, it's unlikely
+         to be the LLC so we can imagine the LLC is unified. Note that
+         some systems such as LX2/Armada8K (and Neoverse-N1 devices as
+         well) may have 2 cores per L2, yet this doesn't allow to infer
+         anything regarding the absence of an L3. Core2-quad has 2 cores
+         per L2 with no L3, like Armada8K. LX2 has 2 cores per L2 yet does
+         have an L3 which is not necessarily reported.
+
+       - this needs to be done per {node,package} !
+         => core_siblings and thread_siblings seem to be the only portable
+            ones to figure packages and threads
+
+At the very least, when multiple nodes are possibly present, there is a
+symlink "node0", "node1" etc in the cpu entry. It requires a lookup for each
+cpu directory though while reading /sys/devices/system/node/node*/cpulist is
+much cheaper.
+
+There's some redundancy in this. Probably better approach:
+
+1) if there is more than 1 CPU:
+  - if cache/index3 exists, use its cpulist to pre-group entries.
+  - else if topology or node exists, use (node,package,die,core_siblings) to
+    group entries
+  - else pre-create a single large group
+
+2) if there is more than 1 CPU and less than max#groups:
+  - for each group, if no cache/index3 exists and cache/index2 exists and some
+    index2 entries contain at least two CPUs of different cores or a single one
+    for a 2-core system, then use that to re-split the group.
+
+  - if in the end there are too many groups, remerge some of them (?) or stick
+    to the previous layout (?)
+
+  - if in the end there are too many CPUs in a group, cut as needed, if
+    possible with an integral result (/2, /3, ...)
+
+3) L1 cache / thread_siblings should be used to associate CPUs by cores in
+   the same groups.
+
+Maybe instead it should be done bottom->top by collecting info and merging
+groups while keeping CPU lists ordered to ease later splitting.
+
+  1) create a group per bound CPU
+  2) based on thread_siblings, detect CPUs that are on the same core, merge
+     their groups. They may not always create similarly sized groups.
+     => eg: epyc keeps 24 groups such as {0,24}, ...
+            ryzen 2700x keeps 4 groups such as {0,1}, ...
+            rk3588 keeps 3 groups {0-3},{4-5},{6-7}
+  3) based on cache index0/1, detect CPUs that are on the same L1 cache,
+     merge their groups. They may not always create similarly sized groups.
+  4) based on cache index2, detect CPUs that are on the same L2 cache, merge
+     their groups. They may not always create similarly sized groups.
+     => eg: mcbin now keeps 2 groups {0-1},{2,3}
+  5) At this point there may possibly be too many groups (still one per CPU,
+     e.g. when no cache info was found or there are many cores with their own
+     L2 like on SPR) or too large one (when all cores are indeed on the same
+     L2).
+
+     5.1) if there are as many groups as bound CPUs, merge them all together in
+          a single one => lx2, altra, mcbin
+     5.2) if there are still more than max#groups, merge them all together in a
+          single one since the splitting criterion is not relevant
+     5.3) if there is a group with too many CPUs, split it in two if integral,
+          otherwise 3, etc, trying to add the least possible number of groups.
+          If too difficult (e.g. result less than half the authorized max),
+          let's just round around N/((N+63)/64).
+     5.4) if at the end there are too many groups, warn that we can't optimize
+          the setup and are limiting ourselves to the first node or 64 CPUs.
+
+Observations:
+ - lx2 definitely works better with everything bound together than by creating
+   8 groups (~130k rps vs ~120k rps)
+   => does this mean we should assume a unified L3 if there's no L3 info, and
+      remerge everything ? Likely Altra would benefit from this as well. mcbin
+      doesn't notice any change (within noise in both directions)
+
+ - on x86 13th gen, 2 P-cores and 8 E-cores. The P-cores support HT, not the
+   E-cores. There's no cpu_capacity there, but the cluster_id is properly set.
+   => proposal: when a machine reports both single-threaded cores and SMT,
+      consider the SMT ones bigger and use them.
+
+Problems: how should auto-detection interfer with user-settings ?
+
+- Case 1: program started with a reduced taskset
+  => current: this serves to the the thread count first, and to map default
+     threads to CPUs if they are not affected by a cpu-map.
+
+  => we want to keep that behavior (i.e. use all these threads) but only
+     change how the thread-groups are arranged.
+
+  - example: start on the first 6c12t of an EPYC74F3, should automatically
+    create 2 groups for the two sockets.
+
+  => should we brute-force all thread-groups combinations to figure how the
+     threads will spread over cpu-map and which one is better ? Or should we
+     decide to ignore input mapping as soon as there's at least one cpu-map?
+     But then which one to use ? Or should we consider that cpu-map only works
+     with explicit thread-groups ?
+
+- Case 2: taskset not involved, but nbthread and cpu-map in the config. In
+  fact a pretty standard 2.4-2.8 config.
+  => maybe the presence of cpu-map and no thread-groups should be sufficient
+     to imply a single thread-group to stay compatible ? Or maybe start as
+     many thread-groups as are referenced in cpu-map ? Seems like cpu-map and
+     thread-groups work hand-in-hand regarding topology since cpu-map
+     designates hardware CPUs so the user knows better than haproxy. Thus
+     why should be try to do better ?
+
+- Case 3: taskset not involved, nbthread not involved, cpu-map not involved,
+  only thread-groups
+  => seems like an ideal approach. Take all online CPUs and try to cut them
+     into equitable thread groups ? Or rather, since nbthreads is not forced,
+     better sort the clusters and bind to the N first clusters only ? If too
+     many groups for the clusters, then try to refine them ?
+
+- Case 4: nothing specified at all (default config, target)
+  => current: uses only one thread-group with all threads (max 64).
+  => desired: bind only to performance cores and cut them in a few groups
+     based on l3, package, cluster etc.
+
+- Case 5: nbthread only in the config
+  => might match a docker use case. No group nor cpu-map configured. Figure
+     the best group usage respecting the thread count.
+
+- Case 6: some constraints are enforced in the config (e.g. threads-hard-limit,
+  one-thread-per-core, etc).
+  => like 3, 4 or 5 but with selection adjustment.
+
+- Case 7: thread-groups and generic cpu-map 1/all, 2/all... in the config
+  => user just wants to use cpu-map as a taskset alternative
+  => need to figure number of threads first, then cut them in groups like
+     today, and only then the cpu-map are found. Can we do better ? Not sure.
+     Maybe just when cpu-map is too lax (e.g. all entries reference the same
+     CPUs). Better use a special "cpumap all/all 0-19" for this, but not
+     implemented for now.
+
+Proposal:
+  - if there is any cpu-map, disable automatic CPU assignment
+  - if there is any cpu-map, disable automatic thread group detection
+  - if taskset was forced, disable automatic CPU assignment
+
+### 2023-07-17 ###
+
+=> step  1: mark CPUs enabled at boot    (cpu_detect_usable)
+// => step  2: mark CPUs referenced in cpu-map => no, no real meaning
+=> step  3: identify all CPUs topologies + NUMA (cpu_detect_topology)
+
+=> step  4: if taskset && !cpu-map, mark all non-bound CPUs as unusable (UNAVAIL ?)
+            => which is the same as saying if !cpu-map.
+=> step  5: if !cpu-map, sort usable CPUs and find the best set to use
+//=> step  6: if cpu-map, mark all non-covered CPUs are unusable => not necessarily possible if partial cpu-map
+
+=> step  7: if thread-groups && cpu-map, nothing else to do
+=> step  8: if cpu-map && !thread-groups, thread-groups=1
+=> step  9: if thread-groups && !cpu-map, use that value to cut the thread set
+=> step 10: if !cpu-map && !thread-groups, detect the optimal thread-group count
+
+=> step 11: if !cpu-map, cut the thread set into mostly fair groups and assign
+            the group numbers to CPUs; create implicit cpu-maps.
+
+Ideas:
+  - use minthr and maxthr.
+    If nbthread, minthr=maxthr=nbthread, else if taskset_forced, maxthr=taskset_thr,
+    minthr=1, else minthr=1, maxthr=cpus_enabled.
+
+  - use CPU_F_ALLOWED (or DISALLOWED?) and CPU_F_REFERENCED and CPU_F_EXCLUDED ?
+    Note: cpu-map doesn't exclude, it only includes. Taskset does exclude. Also,
+    cpu-map only includes the CPUs that will belong to the correct groups & threads.
+
+  - Usual startup: taskset presets the CPU sets and sets the thread count. Tgrp
+    defaults to 1, then threads indicated in cpu-map get their CPU assigned.
+    Other ones are not changed. If we say that cpu-map => tgrp==1 then it means
+    we can infer automatic grouping for group 1 only ?
+      => it could be said that the CPUs of all enabled groups mentioned in
+         cpu-map are considered usable, but we don't know how many of these
+         will really have threads started on.
+
+    => maybe completely ignore cpu-map instead (i.e. fall back to thread-groups 1) ?
+    => automatic detection would mean:
+         - if !cpu-map && !nbthrgrp => must automatically detect thgrp
+         - if !cpu-map => must automatically detect binding
+         - otherwise nothing
+
+Examples of problems:
+
+        thread-groups 4
+        nbthreads 128
+        cpu-map 1/all 0-63
+        cpu-map 2/all 128-191
+
+        => 32 threads per group, hence grp 1 uses 0-63 and grp 2 128-191,
+           grp 3 and grp 4 unknown, in practice on boot CPUs.
+
+        => could we demand that if one cpu-map is specified, then all groups
+           are covered ? Do we need really this after all ? i.e. let's just not
+           bind other threads and that's all (and what is written).
+
+
+Calls from haproxy.c:
+
+    cpu_detect_usable()
+    cpu_detect_topology()
+
++   thread_detect_count()
+         => compute nbtgroups
+         => compute nbthreads
+
+    thread_assign_cpus() ?
+
+    check_config_validity()
+
+
+BUGS:
+  - cpu_map[0].proc still used for the whole process in daemon mode (though not
+    in foreground mode)
+    -> whole process bound to thread group 1
+    -> binding not working in foreground
+
+  - cpu_map[x].proc ANDed with the thread's map depite thread's map apparently
+    never set
+    -> group binding ignored ?
+
+2023-09-05
+----------
+Remember to make the difference between sorting (used for grouping) and
+preference. We should avoid selecting the first CPUs as it encourages to
+use wrong grouping criteria. E.g. CPU capacity has no business being used
+for grouping, it's used for selecting. Support for HT however, does because
+it allows to pack together threads of the same core.
+
+We should also have an option to enable/disable SMT (e.g. max threads per core)
+so that we can skip siblings of cores already assigned. This can be convenient
+with network running on the other sibling.
+
+
+2024-12-26
+----------
+
+Some interesting cases about intel 14900. The CPU has 8 P-cores and 16 E-cores.
+Experiments in the lab show excellent performance by binding the network to E
+cores and haproxy to P cores. Here's how the clusters are made:
+
+$ grep -h . /sys/devices/system/cpu/cpu*/topology/package_cpus | sort |uniq -c
+     32 ffffffff
+
+  => expected
+
+$ grep -h . /sys/devices/system/cpu/cpu*/topology/die_cpus | sort |uniq -c
+     32 ffffffff
+
+  => all CPUs on the same die
+
+$ grep -h . /sys/devices/system/cpu/cpu*/topology/cluster_cpus | sort |uniq -c
+      2 00000003
+      2 0000000c
+      2 00000030
+      2 000000c0
+      2 00000300
+      2 00000c00
+      2 00003000
+      2 0000c000
+      4 000f0000
+      4 00f00000
+      4 0f000000
+      4 f0000000
+
+  => 1 "cluster" per core on each P-core (2 threads, 8 clusters total)
+  => 1 "cluster" per 4 E-cores (4 clusters total)
+  => It can be difficult to split that into groups by just using this topology.
+
+$ grep -h . /sys/devices/system/cpu/cpu*/cache/index3/shared_cpu_list | sort |uniq -c
+     32 0-31
+
+  => everyone shares a uniform L3 cache
+
+$ grep -h . /sys/devices/system/cpu/cpu*/cache/index2/shared_cpu_map | sort |uniq -c
+      2 00000003
+      2 0000000c
+      2 00000030
+      2 000000c0
+      2 00000300
+      2 00000c00
+      2 00003000
+      2 0000c000
+      4 000f0000
+      4 00f00000
+      4 0f000000
+      4 f0000000
+
+  => L2 is split like the respective "clusters" above.
+
+Semms like one would like to split them into 12 groups :-/  Maybe it still
+remains relevant to consider L3 for grouping, and core performance for
+selection (e.g. evict/prefer E-cores depending on policy).
+
+Differences between P and E cores on 14900:
+
+- acpi_cppc/*perf : pretty useful but not always there (e.g. aloha)
+- cache index0: 48 vs 32k (bigger CPU has smaller cache)
+- cache index1: 32 vs 64k (smaller CPU has bigger cache)
+- cache index2: 2 vs 4M, but dedicated per core vs shared per cluster (4 cores)
+
+=> probably that the presence of a larger "cluster" with less cache per
+   avg core is an indication of a smaller CPU set. Warning however, some
+   CPUs (e.g. S922X) have a large (4) cluster of big cores and a small (2)
+   cluster of little cores.
+
+
+diff -urN cpu0/acpi_cppc/lowest_nonlinear_perf cpu16/acpi_cppc/lowest_nonlinear_perf
+--- cpu0/acpi_cppc/lowest_nonlinear_perf        2024-12-26 18:39:27.563410317 +0100
++++ cpu16/acpi_cppc/lowest_nonlinear_perf       2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-20
++15
+diff -urN cpu0/acpi_cppc/nominal_perf cpu16/acpi_cppc/nominal_perf
+--- cpu0/acpi_cppc/nominal_perf 2024-12-26 18:39:27.563410317 +0100
++++ cpu16/acpi_cppc/nominal_perf        2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-40
++24
+diff -urN cpu0/acpi_cppc/reference_perf cpu16/acpi_cppc/reference_perf
+--- cpu0/acpi_cppc/reference_perf       2024-12-26 18:39:27.563410317 +0100
++++ cpu16/acpi_cppc/reference_perf      2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-40
++24
+diff -urN cpu0/cache/index0/size cpu16/cache/index0/size
+--- cpu0/cache/index0/size      2024-12-26 18:39:27.563410317 +0100
++++ cpu16/cache/index0/size     2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-48K
++32K
+diff -urN cpu0/cache/index1/shared_cpu_list cpu16/cache/index1/shared_cpu_list
+--- cpu0/cache/index1/shared_cpu_list   2024-12-26 18:39:27.563410317 +0100
++++ cpu16/cache/index1/shared_cpu_list  2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-0-1
++16
+diff -urN cpu0/cache/index1/shared_cpu_map cpu16/cache/index1/shared_cpu_map
+--- cpu0/cache/index1/shared_cpu_map    2024-12-26 18:39:27.563410317 +0100
++++ cpu16/cache/index1/shared_cpu_map   2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-00000003
++00010000
+diff -urN cpu0/cache/index1/size cpu16/cache/index1/size
+--- cpu0/cache/index1/size      2024-12-26 18:39:27.563410317 +0100
++++ cpu16/cache/index1/size     2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-32K
++64K
+diff -urN cpu0/cache/index2/shared_cpu_list cpu16/cache/index2/shared_cpu_list
+--- cpu0/cache/index2/shared_cpu_list   2024-12-26 18:39:27.563410317 +0100
++++ cpu16/cache/index2/shared_cpu_list  2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-0-1
++16-19
+--- cpu0/cache/index2/size      2024-12-26 18:39:27.563410317 +0100
++++ cpu16/cache/index2/size     2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-2048K
++4096K
+diff -urN cpu0/topology/cluster_cpus cpu16/topology/cluster_cpus
+--- cpu0/topology/cluster_cpus  2024-12-26 18:39:27.563410317 +0100
++++ cpu16/topology/cluster_cpus 2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-00000003
++000f0000
+diff -urN cpu0/topology/cluster_cpus_list cpu16/topology/cluster_cpus_list
+--- cpu0/topology/cluster_cpus_list     2024-12-26 18:39:27.563410317 +0100
++++ cpu16/topology/cluster_cpus_list    2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-0-1
++16-19
+
+For acpi_cppc, the values differ between machines, looks like nominal_perf
+is always usable:
+
+14900k:
+$ grep '' cpu8/acpi_cppc/*
+cpu8/acpi_cppc/feedback_ctrs:ref:85172004640 del:143944480100
+cpu8/acpi_cppc/highest_perf:255
+cpu8/acpi_cppc/lowest_freq:0
+cpu8/acpi_cppc/lowest_nonlinear_perf:20
+cpu8/acpi_cppc/lowest_perf:1
+cpu8/acpi_cppc/nominal_freq:3200
+cpu8/acpi_cppc/nominal_perf:40
+cpu8/acpi_cppc/reference_perf:40
+cpu8/acpi_cppc/wraparound_time:18446744073709551615
+
+$ grep '' cpu16/acpi_cppc/*
+cpu16/acpi_cppc/feedback_ctrs:ref:84153776128 del:112977352354
+cpu16/acpi_cppc/highest_perf:255
+cpu16/acpi_cppc/lowest_freq:0
+cpu16/acpi_cppc/lowest_nonlinear_perf:15
+cpu16/acpi_cppc/lowest_perf:1
+cpu16/acpi_cppc/nominal_freq:3200
+cpu16/acpi_cppc/nominal_perf:24
+cpu16/acpi_cppc/reference_perf:24
+cpu16/acpi_cppc/wraparound_time:18446744073709551615
+
+altra:
+$ grep '' /sys/devices/system/cpu/cpu0/acpi_cppc/*
+feedback_ctrs:ref:227098452801 del:590247062111
+highest_perf:260
+lowest_freq:1000
+lowest_nonlinear_perf:200
+lowest_perf:100
+nominal_freq:2600
+nominal_perf:260
+reference_perf:100
+
+w3-2345:
+$ grep '' /sys/devices/system/cpu/cpu0/acpi_cppc/*
+feedback_ctrs:ref:4775674480779 del:5675950973600
+highest_perf:45
+lowest_freq:0
+lowest_nonlinear_perf:8
+lowest_perf:5
+nominal_freq:0
+nominal_perf:31
+reference_perf:31
+wraparound_time:18446744073709551615
+
+Other approaches may consist in checking the CPU's max frequency via
+cpufreq, e.g on the N2:
+
+  $ grep . /sys/devices/system/cpu/cpu?/cpufreq/scaling_max_freq
+  /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq:2016000
+  /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq:2016000
+  /sys/devices/system/cpu/cpu2/cpufreq/scaling_max_freq:2400000
+  /sys/devices/system/cpu/cpu3/cpufreq/scaling_max_freq:2400000
+  /sys/devices/system/cpu/cpu4/cpufreq/scaling_max_freq:2400000
+  /sys/devices/system/cpu/cpu5/cpufreq/scaling_max_freq:2400000
+
+However on x86, the cores no longer all have the same frequency, like below on
+the W3-2345, so it cannot always be used to split them into groups, it may at
+best be used to sort them.
+
+  $ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
+  /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq:4500000
+  /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq:4500000
+  /sys/devices/system/cpu/cpu2/cpufreq/scaling_max_freq:4300000
+  /sys/devices/system/cpu/cpu3/cpufreq/scaling_max_freq:4400000
+  /sys/devices/system/cpu/cpu4/cpufreq/scaling_max_freq:4300000
+  /sys/devices/system/cpu/cpu5/cpufreq/scaling_max_freq:4300000
+  /sys/devices/system/cpu/cpu6/cpufreq/scaling_max_freq:4400000
+  /sys/devices/system/cpu/cpu7/cpufreq/scaling_max_freq:4300000
+  /sys/devices/system/cpu/cpu8/cpufreq/scaling_max_freq:4500000
+  /sys/devices/system/cpu/cpu9/cpufreq/scaling_max_freq:4500000
+  /sys/devices/system/cpu/cpu10/cpufreq/scaling_max_freq:4300000
+  /sys/devices/system/cpu/cpu11/cpufreq/scaling_max_freq:4400000
+  /sys/devices/system/cpu/cpu12/cpufreq/scaling_max_freq:4300000
+  /sys/devices/system/cpu/cpu13/cpufreq/scaling_max_freq:4300000
+  /sys/devices/system/cpu/cpu14/cpufreq/scaling_max_freq:4400000
+  /sys/devices/system/cpu/cpu15/cpufreq/scaling_max_freq:4300000
+
+On 14900, not cool either:
+
+  $ grep -h . /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq|sort|uniq -c
+     16 4400000
+     12 5700000
+      4 6000000
+
+Considering that values that are within +/-10% of a cluster's min/max are still
+part of it would seem to work and would make a good rule of thumb.
+
+On x86, the model number might help, here on w3-2345:
+
+  $ grep '^model\s\s' /proc/cpuinfo |sort|uniq -c
+     16 model           : 143
+
+But not always (here: 14900K with 8xP and 16xE):
+
+  $ grep '^model\s\s' /proc/cpuinfo |sort|uniq -c
+     32 model           : 183
+
+On ARM it's rather the part number:
+
+  # a9
+  $ grep part /proc/cpuinfo
+  CPU part        : 0xc09
+  CPU part        : 0xc09
+
+  # a17
+  $ grep part /proc/cpuinfo
+  CPU part        : 0xc0d
+  CPU part        : 0xc0d
+  CPU part        : 0xc0d
+  CPU part        : 0xc0d
+
+  # a72
+  $ grep part /proc/cpuinfo
+  CPU part        : 0xd08
+  CPU part        : 0xd08
+  CPU part        : 0xd08
+  CPU part        : 0xd08
+
+  # a53+a72
+  $ grep part /proc/cpuinfo
+  CPU part        : 0xd03
+  CPU part        : 0xd03
+  CPU part        : 0xd03
+  CPU part        : 0xd03
+  CPU part        : 0xd08
+  CPU part        : 0xd08
+
+  # a53+a73
+  $ grep 'part' /proc/cpuinfo
+  CPU part        : 0xd03
+  CPU part        : 0xd03
+  CPU part        : 0xd09
+  CPU part        : 0xd09
+  CPU part        : 0xd09
+  CPU part        : 0xd09
+
+  # a55+a76
+  $ grep 'part' /proc/cpuinfo
+  CPU part        : 0xd05
+  CPU part        : 0xd05
+  CPU part        : 0xd05
+  CPU part        : 0xd05
+  CPU part        : 0xd0b
+  CPU part        : 0xd0b
+  CPU part        : 0xd0b
+  CPU part        : 0xd0b
+
+
+2024-12-27
+----------
+
+Such machines with P+E cores are becoming increasingly common. Some like the
+CIX-P1 can even provide 3 levels of performance: 4 big cores (A720-2.8G), 4
+medium cores (A720-2.4G), 4 little cores (A520-1.8G). Architectures like below
+will become the norm, and can be used under different policies:
+
+                      +-----------------------------+
+                      |             L3              |
+                      +---+----------+----------+---+
+                          |          |          |
+                      +---+---+  +---+---+  +---+---+
+                      | P | P |  | E | E |  | E | E |
+                      +---+---+  +---+---+  +---+---+
+   Policy:            | P | P |  | E | E |  | E | E |
+   -------            +---+---+  +---+---+  +---+---+
+   1 group, min:         N/A         0          0
+   1 group, max:          0         N/A        N/A
+   1 group, all:          0          0          0
+   2 groups, min:        N/A         0          1
+   2 groups, full:        0          1          1
+   3 groups:              0          1          2
+
+In dual-socket or multiple dies it can even become more complicated:
+
+   +---+---+  +---+---+  +---+---+
+   | P | P |  | E | E |  | E | E |
+   +---+---+  +---+---+  +---+---+
+   | P | P |  | E | E |  | E | E |
+   +---+---+  +---+---+  +---+---+
+       |          |          |
+   +---+----------+----------+---+
+   |            L3.0             |
+   +-----------------------------+
+
+   +-----------------------------+
+   |            L3.1             |
+   +---+----------+----------+---+
+       |          |          |
+   +---+---+  +---+---+  +---+---+
+   | P | P |  | E | E |  | E | E |
+   +---+---+  +---+---+  +---+---+
+   | P | P |  | E | E |  | E | E |
+   +---+---+  +---+---+  +---+---+
+
+Setting only a thread count would yield interesting things above:
+  1-4T: P.0
+  5-8T: P.0, P.1 (2 grp)
+ 9-16T: P.0, E.0, P.1, E.1 (3-4 grp)
+17-24T: PEE.0, PEE.1 (5-6 grp)
+
+With forced tgrp = 1:
+  - only fill node 0 first (P then PE, then PEE)
+
+With forced tgrp = 2:
+
+  def:  P.0, P.1
+  2-4T: P.0 only ?
+  6-8T: P.0, P.1
+ 9-24T: PEE.0, PEE.1
+
+With dual-socket, dual-die, it becomes:
+
+   +---+---+  +---+---+  +---+---+  '  +---+---+  +---+---+  +---+---+
+   | P | P |  | E | E |  | E | E |  '  | P | P |  | E | E |  | E | E |
+   +---+---+  +---+---+  +---+---+  '  +---+---+  +---+---+  +---+---+
+   | P | P |  | E | E |  | E | E |  '  | P | P |  | E | E |  | E | E |
+   +---+---+  +---+---+  +---+---+  '  +---+---+  +---+---+  +---+---+
+       |          |          |      '      |          |          |
+   +---+----------+----------+---+  '  +---+----------+----------+---+
+   |            L3.0.0           |  '  |            L3.1.0           |
+   +-----------------------------+  '  +-----------------------------+
+                                    '
+   +-----------------------------+  '  +-----------------------------+
+   |            L3.0.1           |  '  |            L3.1.1           |
+   +---+----------+----------+---+  '  +---+----------+----------+---+
+       |          |          |      '      |          |          |
+   +---+---+  +---+---+  +---+---+  '  +---+---+  +---+---+  +---+---+
+   | P | P |  | E | E |  | E | E |  '  | P | P |  | E | E |  | E | E |
+   +---+---+  +---+---+  +---+---+  '  +---+---+  +---+---+  +---+---+
+   | P | P |  | E | E |  | E | E |  '  | P | P |  | E | E |  | E | E |
+   +---+---+  +---+---+  +---+---+  '  +---+---+  +---+---+  +---+---+
+
+In such conditions, it could make sense to first enumerate all the available
+cores with all their characteristics, and distribute them between "buckets"
+representing the thread groups:
+
+  1. create the min number of tgrp (tgrp.min)
+  2. it's possible to automatically create more until tgrp.max
+  -> cores are sorted by performance then by proximity. They're
+     distributed in order into existing buckets, and if too distant,
+     then new groups are created. It could allow for example to use
+     all P-cores in the DSDD model above, split into 4 tgrp.
+  -> the total number of threads is then discovered at the end.
+
+
+It seems in the end that such binding policies (P, E, single/multi dies,
+single/multi sockets etc) should be made more accessible to the user. What
+we're missing in "cpu-map" is the ability to apply to the whole process in
+fact, so that it can supersede taskset. Indeed, right now, cpu-map requires
+too many details and that's why it often remains easier to deal with taskset,
+particularly when dealing with thread groups.
+
+We can revisit the situation differently. First, let's keep in mind that
+cpu-map is a restriction. It means "use no more than these", it does not
+mean "use all of these". So it totally makes sense to use it to replace
+taskset at the process level without interfering with groups detection.
+We could then have:
+
+  - "cpu-map all|process|global|? ..."  to apply to the whole process
+  - then special keywords for the CPUs designation, among:
+     - package (socket) number
+     - die number (CCD)
+     - L3 number (CCX)
+     - cluster type (big/performant, medium, little/efficient)
+     - use of SMT or not, and which ones
+  - maybe optional numbers before these to indicate (any two of them),
+    e.g. "4P" to indicate "4 performance cores".
+
+Question: how would we designate "only P cores of socket 0" ? Or
+          "only thread 0 of all P cores" ?
+
+One benefit of such a declaration method is that it can make nbthread often
+useless and automatic while still portable across a whole fleet of servers. E.g.
+if "cpu-map all S0P*T0" would designate thread 0 of all P-cores of socket 0, it
+would mean the same on all machines.
+
+Another benefit is that we can make cpu-map and automatic detection more
+exclusive:
+  - cpu-map all => equivalent of taskset, leaves auto-detection on
+  - cpu-map thr => disables auto-detection
+
+So in the end:
+  - cpu-map all restricts the CPUs the process may use
+    -> auto-detection starts from here and sorts them
+  - thread-groups offers more "buckets" to arrange distant CPUs in the
+    same process
+  - nbthread limits the number of threads we'll use
+    -> pick the most suited ones (at least thr.min, at most thr.max)
+       and distribute them optimally among the number of thread groups.
+
+One question remains: is it always possible to automatically configure
+thread-groups ? Maybe it's possible after the detection to set an optimal
+one between grp.min and grp.max ? (e.g. socket count, core types, etc).
+
+It still seems that a policy such as "optimize-for resources|perfomance"
+would still help quite a bit.
+
+-> what defines a match between a CPU core and a group:
+   - cluster identification:
+     - either cluster_cpus if present (and sometimes), or:
+     - pkg+die+ccd number
+     - same LLC instance (L3 if present, L2 if no L3 etc)
+     - CPU core model ("model" on x86, "CPU part" on arm)
+     - number of SMT per core
+   - speed if known:
+     - /sys/devices/system/cpu/cpu0/acpi_cppc/nominal_perf if available
+     - or /sys/devices/system/cpu/cpu15/cpufreq/scaling_max_freq +/- 10%
+
+PB: on intel P+E, clusters of E cores sharing the same L2+L3, but P cores are
+alone on their L3 => poor grouping.
+
+Maybe one approach could be to characterize how L3/L2 are used. E.g. on the
+14900, we have:
+  - L3 0     => all cpus there
+  - L2 0..7  => 1C2T per L2
+  - L2 8..11 => 4C4T per L2
+  => it's obvious that CPUs connected to L2 #8..11 are not the same as those
+     on L2 #0..7. We could make something with them.
+  => it does not make sense to ditch the L2 distinction due to L3 being
+     present and the same, though it doesn't make sense to use L3 either.
+     Maybe elements with a cardinality of 1 should just be ignored. E.g.
+     cores per cache == 1 => ignore L2. Probably not true per die/pkg
+     though.
+  => replace absent or irrelevant info with "?"
+
+Note that for caches we have the list of CPUs, not the list of cores, so
+we need to remap that invidivually to cores.
+
+Warning: die_id, core_id etc are per socket, not per system. Worse, on Altra,
+core_id has gigantic values (multiples of +1 and +256). However core_cpus_list
+indicates other threads and could be a solution to create our own global core
+ID. Also, cluster_id=-1 found on all cores for A8040 on kernel 6.1.
+
+Note that LLC is always the first discriminator. But within a same LLC we can
+have the issues above (e.g. 14900).
+
+Would an intermediate approach like this work ?
+-----------------------------------------------
+  1) first split by LLC (also test with L3-less A8040, N2800, x5-8350)
+  2) within LLC, check of we have different cores (model, perf, freq?)
+     and resplit
+  3) divide again so that no group has more than 64 CPUs
+
+  => it looks like from the beginning that's what we're trying to do:
+     preserve locality first, then possibly trim down the number of cores
+     if some don't bring sufficient benefit. It possibly avoids the need
+     to identify dies etc. It still doesn't completely solve the 14900
+     though.
+
+Multi-die CPUs worth checking:
+  Pentium-D (Presler, Dempsey: two 65nm dies)
+  Core2Quad Q6600/Q6700 (Kentsfield, Clowertown: two 65nm dual-core dies)
+  Core2Quad Q8xxx/Q9xxx (Yorkfield, Harpertown, Tigerton: two 45nm dual-core dies)
+  - atom 330 ("diamondville") is really a dual-die
+  - note that atom x3-z8350 ("cherry trail"), N2800 ("cedar trail") and D510
+    ("pine trail") are single-die (verified) but have two L2 caches and no L3.
+  Note that these are apparently not identified as multi-die (Q6600 has die=0).
+
+It *seems* that in order to form groups we'll first have to sort by topology,
+and only after that sort by performance so as to choose preferred CPUs.
+Otherwise we could end up trying to form inter-socket CPU groups first in
+case we're forced to mix adjacent CPUs due to too many groups.
+
+
+2025-01-07
+----------
+
+What is needed in fact is to act on two directions:
+
+  - binding restrictions: the user doesn't want the process to run on
+    second node, on efficient cores, second thread of each core, so
+    they're indicating where (not) to bind. This is a strict choice,
+    and it overrides taskset. That's the process-wide cpu-map.
+
+  - user preferences / execution profile: the user expresses their wishes
+    about how to allocate resources. This is only a binding order strategy
+    among a few existing ones that help easily decide which cores to select.
+    In this case CPUs are not enumerated. We can imagine choices as:
+
+      - full : use all permitted cores
+      - performance: use all permitted performance cores (all sockets)
+      - single-node: (like today): use all cores of a single node
+      - balanced: use a reasonable amount of perf cores (e.g. all perf
+        cores of a single socket)
+      - resources: use a single cluster of efficient cores
+      - minimal: use a single efficient core
+
+By sorting CPUs first on the performance, then applying the filtering based on
+the profile to eliminate more CPUs, then applying the limit on the desired max
+number of threads, then sorting again on the topology, it should be possible to
+draw a list of usable CPUs that can then be split in groups along the L3s.
+
+It even sounds likely that the CPU profile or allocation strategy will affect
+the first sort method. E.g:
+  - full: no sort needed though we'll use the same as perf so as to enable
+    the maximum possible high-perf threads when #threads is limited
+  - performance: probably that we should invert the topology so as to maximize
+    memory bandwidth across multiple sockets, i.e. visite node1.core0 just
+    after node0.core0 etc, and visit their threads later.
+  - bandwidth: that could be the same as "performance" one above in fact
+  - (low-)latency: better stay local first
+  - balanced: sort by perf then sockets (i.e. P0, P1, E0, E1)
+  - resources: sort on perf first.
+  - etc
+
+The strategy will also help determine the number of threads when it's not fixed
+in the configuration.
+
+Plan:
+  1) make the profile configurable and implement the sort:
+     - option name? cpu-tuning, cpu-strategy, cpu-policy, cpu-allocation,
+       cpu-selection, cpu-priority, cpu-optimize-for, cpu-prefer, cpu-favor,
+       cpu-profile
+       => cpu-selection
+
+  2) make the process-wide cpu-map configurable
+  3) extend cpu-map to make it possible to designate symbolic groups
+     (e.g. "ht0/ht1, node 0, 3*CCD, etc)
+
+Also, offering an option to the user to see how haproxy sees the CPUs and the
+bindings for various profiles would be a nice improvement helping them make
+educated decisions instead of trying blindly.
+
+2025-01-11
+----------
+
+Configuration profile: there are multiple dimensions:
+  - preferences between cores types
+  - never use a given cpu type
+  - never use a given cpu location
+
+Better use something like:
+  - ignore-XXX   -> never use XXX
+  - avoid-XXX    -> prefer not to use XXX
+  - prefer-XXX   -> prefer to use XXX
+  - restrict-XXX -> only use XXX
+
+"XXX" could be "single-threaded", "dual-threaded", "first-thread",
+"second-thread", "first-socket", "second-socket", "slowest", "fastest",
+"node-XXX" etc.
+
+We could then have:
+  - cpu-selection restrict-first-socket,ignore-slowest,...
+
+Then some of the keywords could simply be shortcuts for these.
+
+2025-01-30
+----------
+Problem: we need to set the restrictions first to eliminate undesired CPUs,
+         then sort according to the desired preferences so as to pick what
+         is considered the best CPUs. So the preference really looks like
+         a different setting.
+
+More precisely, the final stategy involves multiple criteria. For example,
+let's say that the number of threads is set to 4 and we've restricted ourselves
+to using the first thread of each CPU core. We're on an EPYC74F3, there are 3
+cores per CCX. One algorithm (resource) would create one group with 3 threads
+on the first CCX and 1 group of 1 thread on the next one, then let each of
+these threads bind to all the enabled CPU cores of their respective groups.
+Another algo (performance) would avoid sharing and would want to place one
+thread per CCX, causing the creation of 4 groups of 1 thread each. A third
+algo (balanced) would probably say that 4 threads require 2 CCX hence 2
+groups, thus there should be 2 threads per group, and it would bind 2 threads
+on all cores of the first CCX and the 2 remaining ones on the second.
+
+And if the thread count is not set, these strategies will also do their best
+to figure the optimal count. Resource would probably use 1 core max, moderate
+one CCX max, balanced one node max, performance all of them.
+
+This means that these CPU selection strategries should provide multiple
+functions:
+  - how to sort CPUs
+  - how to count how many is best within imposed rules
+
+The other actions seem to only be static. This also means that "avoid" or
+"prefer" should maybe not be used in the end, even in the sorting algo ?
+
+Or maybe these are just enums or bits in a strategy and all are considered
+at the same time everywhere. For example the thread counting could consider
+the presence of "avoid-XXX" during the operations. But how to codify XXX is
+complicated then.
+
+Maybe a scoring system could work:
+  - default: all CPUs score = 1000
+  - ignore-XXX: foreach(XXX) set score to 0
+  - restrict-XXX: foreach(YYY not XXX), set score to 0
+  - avoid-XXX: foreach(XXX) score *= 0.8
+  - prefer-XXX: foreach(XXX) score *= 1.25
+
+This supports being ignored for up to 30 different reasons before being
+permanently disabled, which is sufficient.
+
+Then sort according to score, and pick at least min_thr CPUs and continue as
+long as not max_thr or score < 1000 ("avoid"). This gives the thread count. It
+does not permit anything inter-CPU though. E.g. large vs medium vs small cores,
+or sort by locality or frequency. But maybe these ones would use a different
+strategy then and would use the score as a second sorting key (after which
+one?). Or maybe there would be 2 passes, one which avoids <1000 and another
+one which completes up to #min_thr including those <1000, in which case we
+never sort per score.
+
+We can do a bit better to respect the tgrp min/max as well: we can count what
+it implies in terms of number of tgrps (#LLC or clusters) and decide to refrain
+from adding theads which would exceed max_tgrp, but we'd possibly continue to
+add score<1000 CPUs until at least enough threads to reach min_tgrp.
+
+######## new captures ###########
+CIX-P1 / radxa Orion O6 (no topology exported):
+$ ~/haproxy/haproxy  -dc -f /dev/null
+grp=[1..12] thr=[1..12]
+first node = 0
+Note: threads already set to 12
+going to start with nbthread=12 nbtgroups=1
+[keep] thr=  0 -> cpu=  0 pk=00 no=-1 di=00 cl=000 ts=000 capa=1024
+[keep] thr=  1 -> cpu=  1 pk=00 no=-1 di=00 cl=000 ts=001 capa=278
+[keep] thr=  2 -> cpu=  2 pk=00 no=-1 di=00 cl=000 ts=002 capa=278
+[keep] thr=  3 -> cpu=  3 pk=00 no=-1 di=00 cl=000 ts=003 capa=278
+[keep] thr=  4 -> cpu=  4 pk=00 no=-1 di=00 cl=000 ts=004 capa=278
+[keep] thr=  5 -> cpu=  5 pk=00 no=-1 di=00 cl=000 ts=005 capa=905
+[keep] thr=  6 -> cpu=  6 pk=00 no=-1 di=00 cl=000 ts=006 capa=905
+[keep] thr=  7 -> cpu=  7 pk=00 no=-1 di=00 cl=000 ts=007 capa=866
+[keep] thr=  8 -> cpu=  8 pk=00 no=-1 di=00 cl=000 ts=008 capa=866
+[keep] thr=  9 -> cpu=  9 pk=00 no=-1 di=00 cl=000 ts=009 capa=984
+[keep] thr= 10 -> cpu= 10 pk=00 no=-1 di=00 cl=000 ts=010 capa=984
+[keep] thr= 11 -> cpu= 11 pk=00 no=-1 di=00 cl=000 ts=011 capa=1024
+########
+
+2025-02-25 - clarification on the configuration
+-----------------------------------------------
+
+The "two dimensions" above can in fact be summarized like this:
+
+  - exposing the ability for the user to perform the same as "taskset",
+    i.e. restrict the usage to a static subset of the CPUs. We could then
+    have "cpu-set only-node0", "0-39", "ignore-smt1", "ignore-little", etc.
+    => the user defines precise sets to be kept/evicted.
+
+  - then letting the user express what they want to do with the remaining
+    cores. This is a strategy/policy that is used to:
+      - count the optimal number of threads (when not forced), also keeping
+        in mind that it cannot be more than 32/64 * maxtgroups if set.
+      - sort CPUs by order of preference (for when threads are forced or
+        a thread-hard-limit is set).
+
+    It can, partially overlap with the first one. For example, the default
+    strategy could be to focus on a single node. If the user has limited its
+    usage to cores of both nodes, the policy could still further limit this.
+    But this time it should only be a matter of sorting and preference, i.e.
+    nbthread and cpuset are respected. If a policy prefers the node with more
+    cores first, it will sort them according to this, and its algorithm for
+    counting cores will only be used if nbthread is not set, otherwise it may
+    very well end up on two nodes to respect the user's choice.
+
+And once all of this is done, thread groups should be formed based on the
+remaining topology. Similarly, if the number of tgroups is not set, the
+algorithm must try to propose one based on the topology and the maxtgroups
+setting (i.e. find a divider of the #LLC that's lower than or equal to
+maxtgroups), otherwise the configured number of tgroups is respected. Then
+the number of LLCs will be divided by this number of tgroups, and as many
+threads as enabled CPUs of each LLC will be assigned to these respective
+groups.
+
+In the end we should have groups bound to cpu sets, and threads belonging
+to groups mapped to all accessible cpus of these groups.
+
+Note: clusters may be finer than LLCs because they could report finer
+information. We could have a big and a medium cluster share the same L3
+for example. However not all boards report their cluster number (see CIX-P1
+above). However the info about the capacity still allows to figure that and
+should probably be used for that. At this point it would seem logical to say
+that the cluster number is re-adjusted based on the claimed capacity, at
+least to avoid accidentally mixing workloads on heterogenous cores. But
+sorting by cluster number might not necessarily work if allocated randomly.
+So we might need a distinct metric that doesn't require to override the
+system's numbering, like a "set", "group", "team", "bond", "bunch", "club",
+"band", ... that would be first sorted based on LLC (and no finer), and
+second based on capacity, then on L2 etc. This way we should be able to
+respect topology when forming groups.
+
+Note: We need to consider as LLC a level which has more than one core!
+      Otherwise it's supposed to exist and be unique/shared but not reported.
+      => maybe this should be done very early when counting CPUs ?
+         We need to store the LLC level somewhere in the topo.