DOC: design-thoughts: commit numa-auto.txt

author Willy Tarreau <w@1wt.eu>

Tue, 5 Sep 2023 07:20:15 +0000 (09:20 +0200)

committer Willy Tarreau <w@1wt.eu>

Fri, 14 Mar 2025 17:30:30 +0000 (18:30 +0100)
author Willy Tarreau <w@1wt.eu>
Tue, 5 Sep 2023 07:20:15 +0000 (09:20 +0200)
committer Willy Tarreau <w@1wt.eu>
Fri, 14 Mar 2025 17:30:30 +0000 (18:30 +0100)
diff --git a/doc/design-thoughts/numa-auto.txt b/doc/design-thoughts/numa-auto.txt

new file mode 100644 (file)

index 0000000..c58695b
--- /dev/null
+++ b/doc/design-thoughts/numa-auto.txt
@@ -0,0 +1,1458 @@
+2023-07-04 - automatic grouping for NUMA
+
+
+Xeon: (W2145)
+
+willy@debian:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0,8
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0,8
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,8
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-15
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+
+Wtap: i7-8650U
+
+willy@wtap:~ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0,4
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0,4
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,4
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-7
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+
+pcw: i7-6700k
+
+willy@pcw:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0,4
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0,4
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,4
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-7
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+
+nfs: N5105, v5.15
+
+willy@nfs:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-3
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-3
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+
+eeepc: Atom N2800, 5.4 : no L3, L2 not shared.
+
+willy@eeepc:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+
+willy@eeepc:~$ grep '' /sys/devices/system/cpu/cpu2/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu2/cache/index0/shared_cpu_list:2-3
+/sys/devices/system/cpu/cpu2/cache/index1/shared_cpu_list:2-3
+/sys/devices/system/cpu/cpu2/cache/index2/shared_cpu_list:2-3
+/sys/devices/system/cpu/cpu2/cache/index0/type:Data
+/sys/devices/system/cpu/cpu2/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu2/cache/index2/type:Unified
+
+
+dev13: Ryzen 2700X
+
+haproxy@dev13:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-7
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+haproxy@dev13:~$ grep '' /sys/devices/system/cpu/cpu8/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8-9
+/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8-9
+/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8-9
+/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-15
+/sys/devices/system/cpu/cpu8/cache/index0/type:Data
+/sys/devices/system/cpu/cpu8/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu8/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu8/cache/index3/type:Unified
+
+
+dev12: Ryzen 5800X
+
+haproxy@dev12:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0,8
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0,8
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,8
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-15
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+
+amd24: EPYC 74F3
+
+willy@mt:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0,24
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0,24
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,24
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-2,24-26
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+willy@mt:~$ grep '' /sys/devices/system/cpu/cpu8/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,32
+/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,32
+/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,32
+/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:6-8,30-32
+/sys/devices/system/cpu/cpu8/cache/index0/type:Data
+/sys/devices/system/cpu/cpu8/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu8/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu8/cache/index3/type:Unified
+
+willy@mt:~$ grep '' /sys/devices/system/cpu/cpu0/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0,24
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-47
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0-47
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-47
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0,24
+
+
+xeon24: Gold 6212U
+
+willy@mt01:~$ grep '' /sys/devices/system/cpu/cpu8/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,32
+/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,32
+/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,32
+/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:0-47
+/sys/devices/system/cpu/cpu8/cache/index0/type:Data
+/sys/devices/system/cpu/cpu8/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu8/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu8/cache/index3/type:Unified
+
+
+SPR 8480+
+
+$ grep -a '' /sys/devices/system/node/node*/cpulist
+/sys/devices/system/node/node0/cpulist:0-55,112-167
+/sys/devices/system/node/node1/cpulist:56-111,168-223
+
+$ grep -a '' /sys/devices/system/cpu/cpu0/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0,112
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-55,112-167
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0-55,112-167
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-55,112-167
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0,112
+
+$ grep -a '' /sys/devices/system/cpu/cpu0/cache/*/shared_cpu_list
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0,112
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0,112
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,112
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-55,112-167
+
+
+UP Board - Atom X5-8350 : no L3, exactly like Armada8040
+
+willy@up1:~$ grep '' /sys/devices/system/cpu/cpu{0,1,2,3}/cache/index2/*list
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu2/cache/index2/shared_cpu_list:2-3
+/sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_list:2-3
+
+willy@up1:~$ grep '' /sys/devices/system/cpu/cpu0/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-3
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+Atom D510 - kernel 2.6.33
+
+$ strings -fn1 sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list: 0,2
+sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list: 0,2
+sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list: 0,2
+sys/devices/system/cpu/cpu0/cache/index0/type: Data
+sys/devices/system/cpu/cpu0/cache/index1/type: Instruction
+sys/devices/system/cpu/cpu0/cache/index2/type: Unified
+
+$ strings -fn1 sys/devices/system/cpu/cpu?/topology/*list
+sys/devices/system/cpu/cpu0/topology/core_siblings_list: 0-3
+sys/devices/system/cpu/cpu0/topology/thread_siblings_list: 0,2
+sys/devices/system/cpu/cpu1/topology/core_siblings_list: 0-3
+sys/devices/system/cpu/cpu1/topology/thread_siblings_list: 1,3
+sys/devices/system/cpu/cpu2/topology/core_siblings_list: 0-3
+sys/devices/system/cpu/cpu2/topology/thread_siblings_list: 0,2
+sys/devices/system/cpu/cpu3/topology/core_siblings_list: 0-3
+sys/devices/system/cpu/cpu3/topology/thread_siblings_list: 1,3
+
+mcbin: Armada 8040 : no L3, no difference with L3 not reported
+
+root@lg7:~# grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+
+root@lg7:~# grep '' /sys/devices/system/cpu/cpu0/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-3
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-3
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+
+Ampere/monolithic: Ampere Altra 80-26 : L3 not reported
+
+willy@ampere:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+
+willy@ampere:~$ grep '' /sys/devices/system/cpu/cpu0/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-79
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-79
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+
+Ampere/Hemisphere: Ampere Altra 80-26 : L3 not reported
+
+willy@ampere:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+
+willy@ampere:~$ grep '' /sys/devices/system/cpu/cpu0/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-79
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-79
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+willy@ampere:~$ grep '' /sys/devices/system/node/node*/cpulist
+/sys/devices/system/node/node0/cpulist:0-39
+/sys/devices/system/node/node1/cpulist:40-79
+
+
+LX2A: LX2160A => L3 not reported
+
+willy@lx2a:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+
+willy@lx2a:~$ grep '' /sys/devices/system/cpu/cpu2/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu2/cache/index0/shared_cpu_list:2
+/sys/devices/system/cpu/cpu2/cache/index1/shared_cpu_list:2
+/sys/devices/system/cpu/cpu2/cache/index2/shared_cpu_list:2-3
+/sys/devices/system/cpu/cpu2/cache/index0/type:Data
+/sys/devices/system/cpu/cpu2/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu2/cache/index2/type:Unified
+
+willy@lx2a:~$ grep '' /sys/devices/system/cpu/cpu0/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-15
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-15
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+
+Rock5B: RK3588 (big-little A76+A55)
+
+rock@rock-5b:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-7
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+rock@rock-5b:~$ grep '' /sys/devices/system/cpu/cpu{0,4,6}/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-3
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-3
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+/sys/devices/system/cpu/cpu4/topology/core_cpus_list:4
+/sys/devices/system/cpu/cpu4/topology/core_siblings_list:4-5
+/sys/devices/system/cpu/cpu4/topology/die_cpus_list:4
+/sys/devices/system/cpu/cpu4/topology/package_cpus_list:4-5
+/sys/devices/system/cpu/cpu4/topology/thread_siblings_list:4
+/sys/devices/system/cpu/cpu6/topology/core_cpus_list:6
+/sys/devices/system/cpu/cpu6/topology/core_siblings_list:6-7
+/sys/devices/system/cpu/cpu6/topology/die_cpus_list:6
+/sys/devices/system/cpu/cpu6/topology/package_cpus_list:6-7
+/sys/devices/system/cpu/cpu6/topology/thread_siblings_list:6
+
+$ grep '' /sys/devices/system/cpu/cpu*/cpu_capacity
+/sys/devices/system/cpu/cpu0/cpu_capacity:414
+/sys/devices/system/cpu/cpu1/cpu_capacity:414
+/sys/devices/system/cpu/cpu2/cpu_capacity:414
+/sys/devices/system/cpu/cpu3/cpu_capacity:414
+/sys/devices/system/cpu/cpu4/cpu_capacity:1024
+/sys/devices/system/cpu/cpu5/cpu_capacity:1024
+/sys/devices/system/cpu/cpu6/cpu_capacity:1024
+/sys/devices/system/cpu/cpu7/cpu_capacity:1024
+
+
+Firefly: RK3399 (2xA72 + 4xA53) kernel 6.1.28
+
+root@firefly:~# grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+grep: /sys/devices/system/cpu/cpu0/cache/index?/shared_cpu_list: No such file or directory
+grep: /sys/devices/system/cpu/cpu0/cache/index?/type: No such file or directory
+
+root@firefly:~# grep '' /sys/devices/system/cpu/cpu*/cache/index?/{shared_cpu_list,type}
+grep: /sys/devices/system/cpu/cpu*/cache/index?/shared_cpu_list: No such file or directory
+grep: /sys/devices/system/cpu/cpu*/cache/index?/type: No such file or directory
+
+root@firefly:~# dmesg|grep cacheinfo
+[    0.006290] cacheinfo: Unable to detect cache hierarchy for CPU 0
+[    0.016339] cacheinfo: Unable to detect cache hierarchy for CPU 1
+[    0.017692] cacheinfo: Unable to detect cache hierarchy for CPU 2
+[    0.019050] cacheinfo: Unable to detect cache hierarchy for CPU 3
+[    0.020478] cacheinfo: Unable to detect cache hierarchy for CPU 4
+[    0.021660] cacheinfo: Unable to detect cache hierarchy for CPU 5
+[    1.990108] cacheinfo: Unable to detect cache hierarchy for CPU 0
+
+root@firefly:~# grep '' /sys/devices/system/cpu/cpu0/topology/*
+/sys/devices/system/cpu/cpu0/topology/cluster_cpus:0f
+/sys/devices/system/cpu/cpu0/topology/cluster_cpus_list:0-3
+/sys/devices/system/cpu/cpu0/topology/cluster_id:0
+/sys/devices/system/cpu/cpu0/topology/core_cpus:01
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/core_id:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings:3f
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-5
+/sys/devices/system/cpu/cpu0/topology/package_cpus:3f
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-5
+/sys/devices/system/cpu/cpu0/topology/physical_package_id:0
+/sys/devices/system/cpu/cpu0/topology/thread_siblings:01
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+$ grep '' /sys/devices/system/cpu/cpu*/cpu_capacity
+/sys/devices/system/cpu/cpu0/cpu_capacity:381
+/sys/devices/system/cpu/cpu1/cpu_capacity:381
+/sys/devices/system/cpu/cpu2/cpu_capacity:381
+/sys/devices/system/cpu/cpu3/cpu_capacity:381
+/sys/devices/system/cpu/cpu4/cpu_capacity:1024
+/sys/devices/system/cpu/cpu5/cpu_capacity:1024
+
+
+VIM3L: S905D3 (4*A55), kernel 5.14.10
+
+$ grep '' /sys/devices/system/cpu/cpu0/topology/*
+/sys/devices/system/cpu/cpu0/topology/core_cpus:1
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/core_id:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings:f
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-3
+/sys/devices/system/cpu/cpu0/topology/die_cpus:1
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/die_id:-1
+/sys/devices/system/cpu/cpu0/topology/package_cpus:f
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-3
+/sys/devices/system/cpu/cpu0/topology/physical_package_id:0
+/sys/devices/system/cpu/cpu0/topology/thread_siblings:1
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-3
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+
+$ grep '' /sys/devices/system/cpu/cpu*/cpu_capacity
+/sys/devices/system/cpu/cpu0/cpu_capacity:1024
+/sys/devices/system/cpu/cpu1/cpu_capacity:1024
+/sys/devices/system/cpu/cpu2/cpu_capacity:1024
+/sys/devices/system/cpu/cpu3/cpu_capacity:1024
+
+
+Odroid-N2: S922X (4*A73 + 2*A53), kernel 4.9.254
+
+willy@n2:~$ grep '' /sys/devices/system/cpu/cpu*/cache/index?/{shared_cpu_list,type}
+grep: /sys/devices/system/cpu/cpu*/cache/index?/shared_cpu_list: No such file or directory
+grep: /sys/devices/system/cpu/cpu*/cache/index?/type: No such file or directory
+
+willy@n2:~$ sudo dmesg|grep -i 'cache hi'
+[    0.649924] Unable to detect cache hierarchy for CPU 0
+
+No capacity.
+
+Note that it reports 2 physical packages!
+
+willy@n2:~$ grep '' /sys/devices/system/cpu/cpu0/topology/*
+/sys/devices/system/cpu/cpu0/topology/core_id:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings:03
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-1
+/sys/devices/system/cpu/cpu0/topology/physical_package_id:0
+/sys/devices/system/cpu/cpu0/topology/thread_siblings:01
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+willy@n2:~$ grep '' /sys/devices/system/cpu/cpu4/topology/*
+/sys/devices/system/cpu/cpu4/topology/core_id:2
+/sys/devices/system/cpu/cpu4/topology/core_siblings:3c
+/sys/devices/system/cpu/cpu4/topology/core_siblings_list:2-5
+/sys/devices/system/cpu/cpu4/topology/physical_package_id:1
+/sys/devices/system/cpu/cpu4/topology/thread_siblings:10
+/sys/devices/system/cpu/cpu4/topology/thread_siblings_list:4
+
+StarFive VisionFive2 - JH7110, kernel 5.15
+
+willy@starfive:~/haproxy$ ./haproxy -c -f cps3.cfg
+thr   0 -> cpu   0  onl=1 bnd=1 pk=00 no=-1 l3=-1 cl=000 l2=000 ts=000 l1=000
+thr   1 -> cpu   1  onl=1 bnd=1 pk=00 no=-1 l3=-1 cl=000 l2=000 ts=001 l1=001
+thr   2 -> cpu   2  onl=1 bnd=1 pk=00 no=-1 l3=-1 cl=000 l2=000 ts=002 l1=002
+thr   3 -> cpu   3  onl=1 bnd=1 pk=00 no=-1 l3=-1 cl=000 l2=000 ts=003 l1=003
+Configuration file is valid
+
+Graviton2 / Graviton3 ?
+
+
+On PPC64 not everything is available:
+
+  https://www.ibm.com/docs/en/linux-on-systems?topic=cpus-cpu-topology
+
+  /sys/devices/system/cpu/cpu<N>/topology/thread_siblings
+  /sys/devices/system/cpu/cpu<N>/topology/core_siblings
+  /sys/devices/system/cpu/cpu<N>/topology/book_siblings
+  /sys/devices/system/cpu/cpu<N>/topology/drawer_siblings
+
+  # lscpu -e
+  CPU NODE DRAWER BOOK SOCKET CORE L1d:L1i:L2d:L2i ONLINE CONFIGURED POLARIZATION ADDRESS
+  0   1    0      0    0      0    0:0:0:0         yes    yes        horizontal   0
+  1   1    0      0    0      0    1:1:1:1         yes    yes        horizontal   1
+  2   1    0      0    0      1    2:2:2:2         yes    yes        horizontal   2
+  3   1    0      0    0      1    3:3:3:3         yes    yes        horizontal   3
+  4   1    0      0    0      2    4:4:4:4         yes    yes        horizontal   4
+  5   1    0      0    0      2    5:5:5:5         yes    yes        horizontal   5
+  6   1    0      0    0      3    6:6:6:6         yes    yes        horizontal   6
+  7   1    0      0    0      3    7:7:7:7         yes    yes        horizontal   7
+  8   0    1      1    1      4    8:8:8:8         yes    yes        horizontal   8
+  ...
+
+Intel E5-2600v2/v3 has two L3:
+   https://www.enterpriseai.news/2014/09/08/intel-ups-performance-ante-haswell-xeon-chips/
+
+More info on these, and s390's "books" (mostly L4 in fact):
+   https://groups.google.com/g/fa.linux.kernel/c/qgAxjYq8ohI
+
+########################################
+Analysis:
+  - some server ARM CPUs (Altra, LX2) do not return any L3 info though they
+    DO have some. They stop at L2.
+
+  - other CPUs like Atom N2800 and Armada 8040 do not have L3.
+
+  => there's no apparent way to detect that the server CPUs do have an L3.
+  => or maybe we should consider that it's more likely that there is one
+     than none ? Armada works much better with groups than without. It's
+     basically the same topology as N2800.
+
+  => Do we really care then ? No L3 = same L3 for everyone. The problem is
+     that those really without L3 will make a difference on L2 while the
+     other ones not. Maybe we should consider that it does not make sense
+     to cut groups on L2 (i.e. under no circumstance we'll have one group
+     per core).
+
+  => This would mean:
+       - regardless of L3, consider LLC. If the LLC has more than one
+         core per instance, it's likely the last one (not true on LX2
+         but better use 8 groups of 2 than nothing).
+
+       - otherwise if there's a single core per instance, it's unlikely
+         to be the LLC so we can imagine the LLC is unified. Note that
+         some systems such as LX2/Armada8K (and Neoverse-N1 devices as
+         well) may have 2 cores per L2, yet this doesn't allow to infer
+         anything regarding the absence of an L3. Core2-quad has 2 cores
+         per L2 with no L3, like Armada8K. LX2 has 2 cores per L2 yet does
+         have an L3 which is not necessarily reported.
+
+       - this needs to be done per {node,package} !
+         => core_siblings and thread_siblings seem to be the only portable
+            ones to figure packages and threads
+
+At the very least, when multiple nodes are possibly present, there is a
+symlink "node0", "node1" etc in the cpu entry. It requires a lookup for each
+cpu directory though while reading /sys/devices/system/node/node*/cpulist is
+much cheaper.
+
+There's some redundancy in this. Probably better approach:
+
+1) if there is more than 1 CPU:
+  - if cache/index3 exists, use its cpulist to pre-group entries.
+  - else if topology or node exists, use (node,package,die,core_siblings) to
+    group entries
+  - else pre-create a single large group
+
+2) if there is more than 1 CPU and less than max#groups:
+  - for each group, if no cache/index3 exists and cache/index2 exists and some
+    index2 entries contain at least two CPUs of different cores or a single one
+    for a 2-core system, then use that to re-split the group.
+
+  - if in the end there are too many groups, remerge some of them (?) or stick
+    to the previous layout (?)
+
+  - if in the end there are too many CPUs in a group, cut as needed, if
+    possible with an integral result (/2, /3, ...)
+
+3) L1 cache / thread_siblings should be used to associate CPUs by cores in
+   the same groups.
+
+Maybe instead it should be done bottom->top by collecting info and merging
+groups while keeping CPU lists ordered to ease later splitting.
+
+  1) create a group per bound CPU
+  2) based on thread_siblings, detect CPUs that are on the same core, merge
+     their groups. They may not always create similarly sized groups.
+     => eg: epyc keeps 24 groups such as {0,24}, ...
+            ryzen 2700x keeps 4 groups such as {0,1}, ...
+            rk3588 keeps 3 groups {0-3},{4-5},{6-7}
+  3) based on cache index0/1, detect CPUs that are on the same L1 cache,
+     merge their groups. They may not always create similarly sized groups.
+  4) based on cache index2, detect CPUs that are on the same L2 cache, merge
+     their groups. They may not always create similarly sized groups.
+     => eg: mcbin now keeps 2 groups {0-1},{2,3}
+  5) At this point there may possibly be too many groups (still one per CPU,
+     e.g. when no cache info was found or there are many cores with their own
+     L2 like on SPR) or too large one (when all cores are indeed on the same
+     L2).
+
+     5.1) if there are as many groups as bound CPUs, merge them all together in
+          a single one => lx2, altra, mcbin
+     5.2) if there are still more than max#groups, merge them all together in a
+          single one since the splitting criterion is not relevant
+     5.3) if there is a group with too many CPUs, split it in two if integral,
+          otherwise 3, etc, trying to add the least possible number of groups.
+          If too difficult (e.g. result less than half the authorized max),
+          let's just round around N/((N+63)/64).
+     5.4) if at the end there are too many groups, warn that we can't optimize
+          the setup and are limiting ourselves to the first node or 64 CPUs.
+
+Observations:
+ - lx2 definitely works better with everything bound together than by creating
+   8 groups (~130k rps vs ~120k rps)
+   => does this mean we should assume a unified L3 if there's no L3 info, and
+      remerge everything ? Likely Altra would benefit from this as well. mcbin
+      doesn't notice any change (within noise in both directions)
+
+ - on x86 13th gen, 2 P-cores and 8 E-cores. The P-cores support HT, not the
+   E-cores. There's no cpu_capacity there, but the cluster_id is properly set.
+   => proposal: when a machine reports both single-threaded cores and SMT,
+      consider the SMT ones bigger and use them.
+
+Problems: how should auto-detection interfer with user-settings ?
+
+- Case 1: program started with a reduced taskset
+  => current: this serves to the the thread count first, and to map default
+     threads to CPUs if they are not affected by a cpu-map.
+
+  => we want to keep that behavior (i.e. use all these threads) but only
+     change how the thread-groups are arranged.
+
+  - example: start on the first 6c12t of an EPYC74F3, should automatically
+    create 2 groups for the two sockets.
+
+  => should we brute-force all thread-groups combinations to figure how the
+     threads will spread over cpu-map and which one is better ? Or should we
+     decide to ignore input mapping as soon as there's at least one cpu-map?
+     But then which one to use ? Or should we consider that cpu-map only works
+     with explicit thread-groups ?
+
+- Case 2: taskset not involved, but nbthread and cpu-map in the config. In
+  fact a pretty standard 2.4-2.8 config.
+  => maybe the presence of cpu-map and no thread-groups should be sufficient
+     to imply a single thread-group to stay compatible ? Or maybe start as
+     many thread-groups as are referenced in cpu-map ? Seems like cpu-map and
+     thread-groups work hand-in-hand regarding topology since cpu-map
+     designates hardware CPUs so the user knows better than haproxy. Thus
+     why should be try to do better ?
+
+- Case 3: taskset not involved, nbthread not involved, cpu-map not involved,
+  only thread-groups
+  => seems like an ideal approach. Take all online CPUs and try to cut them
+     into equitable thread groups ? Or rather, since nbthreads is not forced,
+     better sort the clusters and bind to the N first clusters only ? If too
+     many groups for the clusters, then try to refine them ?
+
+- Case 4: nothing specified at all (default config, target)
+  => current: uses only one thread-group with all threads (max 64).
+  => desired: bind only to performance cores and cut them in a few groups
+     based on l3, package, cluster etc.
+
+- Case 5: nbthread only in the config
+  => might match a docker use case. No group nor cpu-map configured. Figure
+     the best group usage respecting the thread count.
+
+- Case 6: some constraints are enforced in the config (e.g. threads-hard-limit,
+  one-thread-per-core, etc).
+  => like 3, 4 or 5 but with selection adjustment.
+
+- Case 7: thread-groups and generic cpu-map 1/all, 2/all... in the config
+  => user just wants to use cpu-map as a taskset alternative
+  => need to figure number of threads first, then cut them in groups like
+     today, and only then the cpu-map are found. Can we do better ? Not sure.
+     Maybe just when cpu-map is too lax (e.g. all entries reference the same
+     CPUs). Better use a special "cpumap all/all 0-19" for this, but not
+     implemented for now.
+
+Proposal:
+  - if there is any cpu-map, disable automatic CPU assignment
+  - if there is any cpu-map, disable automatic thread group detection
+  - if taskset was forced, disable automatic CPU assignment
+
+### 2023-07-17 ###
+
+=> step  1: mark CPUs enabled at boot    (cpu_detect_usable)
+// => step  2: mark CPUs referenced in cpu-map => no, no real meaning
+=> step  3: identify all CPUs topologies + NUMA (cpu_detect_topology)
+
+=> step  4: if taskset && !cpu-map, mark all non-bound CPUs as unusable (UNAVAIL ?)
+            => which is the same as saying if !cpu-map.
+=> step  5: if !cpu-map, sort usable CPUs and find the best set to use
+//=> step  6: if cpu-map, mark all non-covered CPUs are unusable => not necessarily possible if partial cpu-map
+
+=> step  7: if thread-groups && cpu-map, nothing else to do
+=> step  8: if cpu-map && !thread-groups, thread-groups=1
+=> step  9: if thread-groups && !cpu-map, use that value to cut the thread set
+=> step 10: if !cpu-map && !thread-groups, detect the optimal thread-group count
+
+=> step 11: if !cpu-map, cut the thread set into mostly fair groups and assign
+            the group numbers to CPUs; create implicit cpu-maps.
+
+Ideas:
+  - use minthr and maxthr.
+    If nbthread, minthr=maxthr=nbthread, else if taskset_forced, maxthr=taskset_thr,
+    minthr=1, else minthr=1, maxthr=cpus_enabled.
+
+  - use CPU_F_ALLOWED (or DISALLOWED?) and CPU_F_REFERENCED and CPU_F_EXCLUDED ?
+    Note: cpu-map doesn't exclude, it only includes. Taskset does exclude. Also,
+    cpu-map only includes the CPUs that will belong to the correct groups & threads.
+
+  - Usual startup: taskset presets the CPU sets and sets the thread count. Tgrp
+    defaults to 1, then threads indicated in cpu-map get their CPU assigned.
+    Other ones are not changed. If we say that cpu-map => tgrp==1 then it means
+    we can infer automatic grouping for group 1 only ?
+      => it could be said that the CPUs of all enabled groups mentioned in
+         cpu-map are considered usable, but we don't know how many of these
+         will really have threads started on.
+
+    => maybe completely ignore cpu-map instead (i.e. fall back to thread-groups 1) ?
+    => automatic detection would mean:
+         - if !cpu-map && !nbthrgrp => must automatically detect thgrp
+         - if !cpu-map => must automatically detect binding
+         - otherwise nothing
+
+Examples of problems:
+
+        thread-groups 4
+        nbthreads 128
+        cpu-map 1/all 0-63
+        cpu-map 2/all 128-191
+
+        => 32 threads per group, hence grp 1 uses 0-63 and grp 2 128-191,
+           grp 3 and grp 4 unknown, in practice on boot CPUs.
+
+        => could we demand that if one cpu-map is specified, then all groups
+           are covered ? Do we need really this after all ? i.e. let's just not
+           bind other threads and that's all (and what is written).
+
+
+Calls from haproxy.c:
+
+    cpu_detect_usable()
+    cpu_detect_topology()
+
++   thread_detect_count()
+         => compute nbtgroups
+         => compute nbthreads
+
+    thread_assign_cpus() ?
+
+    check_config_validity()
+
+
+BUGS:
+  - cpu_map[0].proc still used for the whole process in daemon mode (though not
+    in foreground mode)
+    -> whole process bound to thread group 1
+    -> binding not working in foreground
+
+  - cpu_map[x].proc ANDed with the thread's map depite thread's map apparently
+    never set
+    -> group binding ignored ?
+
+2023-09-05
+----------
+Remember to make the difference between sorting (used for grouping) and
+preference. We should avoid selecting the first CPUs as it encourages to
+use wrong grouping criteria. E.g. CPU capacity has no business being used
+for grouping, it's used for selecting. Support for HT however, does because
+it allows to pack together threads of the same core.
+
+We should also have an option to enable/disable SMT (e.g. max threads per core)
+so that we can skip siblings of cores already assigned. This can be convenient
+with network running on the other sibling.
+
+
+2024-12-26
+----------
+
+Some interesting cases about intel 14900. The CPU has 8 P-cores and 16 E-cores.
+Experiments in the lab show excellent performance by binding the network to E
+cores and haproxy to P cores. Here's how the clusters are made:
+
+$ grep -h . /sys/devices/system/cpu/cpu*/topology/package_cpus | sort |uniq -c
+     32 ffffffff
+
+  => expected
+
+$ grep -h . /sys/devices/system/cpu/cpu*/topology/die_cpus | sort |uniq -c
+     32 ffffffff
+
+  => all CPUs on the same die
+
+$ grep -h . /sys/devices/system/cpu/cpu*/topology/cluster_cpus | sort |uniq -c
+      2 00000003
+      2 0000000c
+      2 00000030
+      2 000000c0
+      2 00000300
+      2 00000c00
+      2 00003000
+      2 0000c000
+      4 000f0000
+      4 00f00000
+      4 0f000000
+      4 f0000000
+
+  => 1 "cluster" per core on each P-core (2 threads, 8 clusters total)
+  => 1 "cluster" per 4 E-cores (4 clusters total)
+  => It can be difficult to split that into groups by just using this topology.
+
+$ grep -h . /sys/devices/system/cpu/cpu*/cache/index3/shared_cpu_list | sort |uniq -c
+     32 0-31
+
+  => everyone shares a uniform L3 cache
+
+$ grep -h . /sys/devices/system/cpu/cpu*/cache/index2/shared_cpu_map | sort |uniq -c
+      2 00000003
+      2 0000000c
+      2 00000030
+      2 000000c0
+      2 00000300
+      2 00000c00
+      2 00003000
+      2 0000c000
+      4 000f0000
+      4 00f00000
+      4 0f000000
+      4 f0000000
+
+  => L2 is split like the respective "clusters" above.
+
+Semms like one would like to split them into 12 groups :-/  Maybe it still
+remains relevant to consider L3 for grouping, and core performance for
+selection (e.g. evict/prefer E-cores depending on policy).
+
+Differences between P and E cores on 14900:
+
+- acpi_cppc/*perf : pretty useful but not always there (e.g. aloha)
+- cache index0: 48 vs 32k (bigger CPU has smaller cache)
+- cache index1: 32 vs 64k (smaller CPU has bigger cache)
+- cache index2: 2 vs 4M, but dedicated per core vs shared per cluster (4 cores)
+
+=> probably that the presence of a larger "cluster" with less cache per
+   avg core is an indication of a smaller CPU set. Warning however, some
+   CPUs (e.g. S922X) have a large (4) cluster of big cores and a small (2)
+   cluster of little cores.
+
+
+diff -urN cpu0/acpi_cppc/lowest_nonlinear_perf cpu16/acpi_cppc/lowest_nonlinear_perf
+--- cpu0/acpi_cppc/lowest_nonlinear_perf        2024-12-26 18:39:27.563410317 +0100
++++ cpu16/acpi_cppc/lowest_nonlinear_perf       2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-20
++15
+diff -urN cpu0/acpi_cppc/nominal_perf cpu16/acpi_cppc/nominal_perf
+--- cpu0/acpi_cppc/nominal_perf 2024-12-26 18:39:27.563410317 +0100
++++ cpu16/acpi_cppc/nominal_perf        2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-40
++24
+diff -urN cpu0/acpi_cppc/reference_perf cpu16/acpi_cppc/reference_perf
+--- cpu0/acpi_cppc/reference_perf       2024-12-26 18:39:27.563410317 +0100
++++ cpu16/acpi_cppc/reference_perf      2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-40
++24
+diff -urN cpu0/cache/index0/size cpu16/cache/index0/size
+--- cpu0/cache/index0/size      2024-12-26 18:39:27.563410317 +0100
++++ cpu16/cache/index0/size     2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-48K
++32K
+diff -urN cpu0/cache/index1/shared_cpu_list cpu16/cache/index1/shared_cpu_list
+--- cpu0/cache/index1/shared_cpu_list   2024-12-26 18:39:27.563410317 +0100
++++ cpu16/cache/index1/shared_cpu_list  2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-0-1
++16
+diff -urN cpu0/cache/index1/shared_cpu_map cpu16/cache/index1/shared_cpu_map
+--- cpu0/cache/index1/shared_cpu_map    2024-12-26 18:39:27.563410317 +0100
++++ cpu16/cache/index1/shared_cpu_map   2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-00000003
++00010000
+diff -urN cpu0/cache/index1/size cpu16/cache/index1/size
+--- cpu0/cache/index1/size      2024-12-26 18:39:27.563410317 +0100
++++ cpu16/cache/index1/size     2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-32K
++64K
+diff -urN cpu0/cache/index2/shared_cpu_list cpu16/cache/index2/shared_cpu_list
+--- cpu0/cache/index2/shared_cpu_list   2024-12-26 18:39:27.563410317 +0100
++++ cpu16/cache/index2/shared_cpu_list  2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-0-1
++16-19
+--- cpu0/cache/index2/size      2024-12-26 18:39:27.563410317 +0100
++++ cpu16/cache/index2/size     2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-2048K
++4096K
+diff -urN cpu0/topology/cluster_cpus cpu16/topology/cluster_cpus
+--- cpu0/topology/cluster_cpus  2024-12-26 18:39:27.563410317 +0100
++++ cpu16/topology/cluster_cpus 2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-00000003
++000f0000
+diff -urN cpu0/topology/cluster_cpus_list cpu16/topology/cluster_cpus_list
+--- cpu0/topology/cluster_cpus_list     2024-12-26 18:39:27.563410317 +0100
++++ cpu16/topology/cluster_cpus_list    2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-0-1
++16-19
+
+For acpi_cppc, the values differ between machines, looks like nominal_perf
+is always usable:
+
+14900k:
+$ grep '' cpu8/acpi_cppc/*
+cpu8/acpi_cppc/feedback_ctrs:ref:85172004640 del:143944480100
+cpu8/acpi_cppc/highest_perf:255
+cpu8/acpi_cppc/lowest_freq:0
+cpu8/acpi_cppc/lowest_nonlinear_perf:20
+cpu8/acpi_cppc/lowest_perf:1
+cpu8/acpi_cppc/nominal_freq:3200
+cpu8/acpi_cppc/nominal_perf:40
+cpu8/acpi_cppc/reference_perf:40
+cpu8/acpi_cppc/wraparound_time:18446744073709551615
+
+$ grep '' cpu16/acpi_cppc/*
+cpu16/acpi_cppc/feedback_ctrs:ref:84153776128 del:112977352354
+cpu16/acpi_cppc/highest_perf:255
+cpu16/acpi_cppc/lowest_freq:0
+cpu16/acpi_cppc/lowest_nonlinear_perf:15
+cpu16/acpi_cppc/lowest_perf:1
+cpu16/acpi_cppc/nominal_freq:3200
+cpu16/acpi_cppc/nominal_perf:24
+cpu16/acpi_cppc/reference_perf:24
+cpu16/acpi_cppc/wraparound_time:18446744073709551615
+
+altra:
+$ grep '' /sys/devices/system/cpu/cpu0/acpi_cppc/*
+feedback_ctrs:ref:227098452801 del:590247062111
+highest_perf:260
+lowest_freq:1000
+lowest_nonlinear_perf:200
+lowest_perf:100
+nominal_freq:2600
+nominal_perf:260
+reference_perf:100
+
+w3-2345:
+$ grep '' /sys/devices/system/cpu/cpu0/acpi_cppc/*
+feedback_ctrs:ref:4775674480779 del:5675950973600
+highest_perf:45
+lowest_freq:0
+lowest_nonlinear_perf:8
+lowest_perf:5
+nominal_freq:0
+nominal_perf:31
+reference_perf:31
+wraparound_time:18446744073709551615
+
+Other approaches may consist in checking the CPU's max frequency via
+cpufreq, e.g on the N2:
+
+  $ grep . /sys/devices/system/cpu/cpu?/cpufreq/scaling_max_freq
+  /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq:2016000
+  /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq:2016000
+  /sys/devices/system/cpu/cpu2/cpufreq/scaling_max_freq:2400000
+  /sys/devices/system/cpu/cpu3/cpufreq/scaling_max_freq:2400000
+  /sys/devices/system/cpu/cpu4/cpufreq/scaling_max_freq:2400000
+  /sys/devices/system/cpu/cpu5/cpufreq/scaling_max_freq:2400000
+
+However on x86, the cores no longer all have the same frequency, like below on
+the W3-2345, so it cannot always be used to split them into groups, it may at
+best be used to sort them.
+
+  $ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
+  /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq:4500000
+  /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq:4500000
+  /sys/devices/system/cpu/cpu2/cpufreq/scaling_max_freq:4300000
+  /sys/devices/system/cpu/cpu3/cpufreq/scaling_max_freq:4400000
+  /sys/devices/system/cpu/cpu4/cpufreq/scaling_max_freq:4300000
+  /sys/devices/system/cpu/cpu5/cpufreq/scaling_max_freq:4300000
+  /sys/devices/system/cpu/cpu6/cpufreq/scaling_max_freq:4400000
+  /sys/devices/system/cpu/cpu7/cpufreq/scaling_max_freq:4300000
+  /sys/devices/system/cpu/cpu8/cpufreq/scaling_max_freq:4500000
+  /sys/devices/system/cpu/cpu9/cpufreq/scaling_max_freq:4500000
+  /sys/devices/system/cpu/cpu10/cpufreq/scaling_max_freq:4300000
+  /sys/devices/system/cpu/cpu11/cpufreq/scaling_max_freq:4400000
+  /sys/devices/system/cpu/cpu12/cpufreq/scaling_max_freq:4300000
+  /sys/devices/system/cpu/cpu13/cpufreq/scaling_max_freq:4300000
+  /sys/devices/system/cpu/cpu14/cpufreq/scaling_max_freq:4400000
+  /sys/devices/system/cpu/cpu15/cpufreq/scaling_max_freq:4300000
+
+On 14900, not cool either:
+
+  $ grep -h . /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq|sort|uniq -c
+     16 4400000
+     12 5700000
+      4 6000000
+
+Considering that values that are within +/-10% of a cluster's min/max are still
+part of it would seem to work and would make a good rule of thumb.
+
+On x86, the model number might help, here on w3-2345:
+
+  $ grep '^model\s\s' /proc/cpuinfo |sort|uniq -c
+     16 model           : 143
+
+But not always (here: 14900K with 8xP and 16xE):
+
+  $ grep '^model\s\s' /proc/cpuinfo |sort|uniq -c
+     32 model           : 183
+
+On ARM it's rather the part number:
+
+  # a9
+  $ grep part /proc/cpuinfo
+  CPU part        : 0xc09
+  CPU part        : 0xc09
+
+  # a17
+  $ grep part /proc/cpuinfo
+  CPU part        : 0xc0d
+  CPU part        : 0xc0d
+  CPU part        : 0xc0d
+  CPU part        : 0xc0d
+
+  # a72
+  $ grep part /proc/cpuinfo
+  CPU part        : 0xd08
+  CPU part        : 0xd08
+  CPU part        : 0xd08
+  CPU part        : 0xd08
+
+  # a53+a72
+  $ grep part /proc/cpuinfo
+  CPU part        : 0xd03
+  CPU part        : 0xd03
+  CPU part        : 0xd03
+  CPU part        : 0xd03
+  CPU part        : 0xd08
+  CPU part        : 0xd08
+
+  # a53+a73
+  $ grep 'part' /proc/cpuinfo
+  CPU part        : 0xd03
+  CPU part        : 0xd03
+  CPU part        : 0xd09
+  CPU part        : 0xd09
+  CPU part        : 0xd09
+  CPU part        : 0xd09
+
+  # a55+a76
+  $ grep 'part' /proc/cpuinfo
+  CPU part        : 0xd05
+  CPU part        : 0xd05
+  CPU part        : 0xd05
+  CPU part        : 0xd05
+  CPU part        : 0xd0b
+  CPU part        : 0xd0b
+  CPU part        : 0xd0b
+  CPU part        : 0xd0b
+
+
+2024-12-27
+----------
+
+Such machines with P+E cores are becoming increasingly common. Some like the
+CIX-P1 can even provide 3 levels of performance: 4 big cores (A720-2.8G), 4
+medium cores (A720-2.4G), 4 little cores (A520-1.8G). Architectures like below
+will become the norm, and can be used under different policies:
+
+                      +-----------------------------+
+                      |             L3              |
+                      +---+----------+----------+---+
+                          |          |          |
+                      +---+---+  +---+---+  +---+---+
+                      | P | P |  | E | E |  | E | E |
+                      +---+---+  +---+---+  +---+---+
+   Policy:            | P | P |  | E | E |  | E | E |
+   -------            +---+---+  +---+---+  +---+---+
+   1 group, min:         N/A         0          0
+   1 group, max:          0         N/A        N/A
+   1 group, all:          0          0          0
+   2 groups, min:        N/A         0          1
+   2 groups, full:        0          1          1
+   3 groups:              0          1          2
+
+In dual-socket or multiple dies it can even become more complicated:
+
+   +---+---+  +---+---+  +---+---+
+   | P | P |  | E | E |  | E | E |
+   +---+---+  +---+---+  +---+---+
+   | P | P |  | E | E |  | E | E |
+   +---+---+  +---+---+  +---+---+
+       |          |          |
+   +---+----------+----------+---+
+   |            L3.0             |
+   +-----------------------------+
+
+   +-----------------------------+
+   |            L3.1             |
+   +---+----------+----------+---+
+       |          |          |
+   +---+---+  +---+---+  +---+---+
+   | P | P |  | E | E |  | E | E |
+   +---+---+  +---+---+  +---+---+
+   | P | P |  | E | E |  | E | E |
+   +---+---+  +---+---+  +---+---+
+
+Setting only a thread count would yield interesting things above:
+  1-4T: P.0
+  5-8T: P.0, P.1 (2 grp)
+ 9-16T: P.0, E.0, P.1, E.1 (3-4 grp)
+17-24T: PEE.0, PEE.1 (5-6 grp)
+
+With forced tgrp = 1:
+  - only fill node 0 first (P then PE, then PEE)
+
+With forced tgrp = 2:
+
+  def:  P.0, P.1
+  2-4T: P.0 only ?
+  6-8T: P.0, P.1
+ 9-24T: PEE.0, PEE.1
+
+With dual-socket, dual-die, it becomes:
+
+   +---+---+  +---+---+  +---+---+  '  +---+---+  +---+---+  +---+---+
+   | P | P |  | E | E |  | E | E |  '  | P | P |  | E | E |  | E | E |
+   +---+---+  +---+---+  +---+---+  '  +---+---+  +---+---+  +---+---+
+   | P | P |  | E | E |  | E | E |  '  | P | P |  | E | E |  | E | E |
+   +---+---+  +---+---+  +---+---+  '  +---+---+  +---+---+  +---+---+
+       |          |          |      '      |          |          |
+   +---+----------+----------+---+  '  +---+----------+----------+---+
+   |            L3.0.0           |  '  |            L3.1.0           |
+   +-----------------------------+  '  +-----------------------------+
+                                    '
+   +-----------------------------+  '  +-----------------------------+
+   |            L3.0.1           |  '  |            L3.1.1           |
+   +---+----------+----------+---+  '  +---+----------+----------+---+
+       |          |          |      '      |          |          |
+   +---+---+  +---+---+  +---+---+  '  +---+---+  +---+---+  +---+---+
+   | P | P |  | E | E |  | E | E |  '  | P | P |  | E | E |  | E | E |
+   +---+---+  +---+---+  +---+---+  '  +---+---+  +---+---+  +---+---+
+   | P | P |  | E | E |  | E | E |  '  | P | P |  | E | E |  | E | E |
+   +---+---+  +---+---+  +---+---+  '  +---+---+  +---+---+  +---+---+
+
+In such conditions, it could make sense to first enumerate all the available
+cores with all their characteristics, and distribute them between "buckets"
+representing the thread groups:
+
+  1. create the min number of tgrp (tgrp.min)
+  2. it's possible to automatically create more until tgrp.max
+  -> cores are sorted by performance then by proximity. They're
+     distributed in order into existing buckets, and if too distant,
+     then new groups are created. It could allow for example to use
+     all P-cores in the DSDD model above, split into 4 tgrp.
+  -> the total number of threads is then discovered at the end.
+
+
+It seems in the end that such binding policies (P, E, single/multi dies,
+single/multi sockets etc) should be made more accessible to the user. What
+we're missing in "cpu-map" is the ability to apply to the whole process in
+fact, so that it can supersede taskset. Indeed, right now, cpu-map requires
+too many details and that's why it often remains easier to deal with taskset,
+particularly when dealing with thread groups.
+
+We can revisit the situation differently. First, let's keep in mind that
+cpu-map is a restriction. It means "use no more than these", it does not
+mean "use all of these". So it totally makes sense to use it to replace
+taskset at the process level without interfering with groups detection.
+We could then have:
+
+  - "cpu-map all|process|global|? ..."  to apply to the whole process
+  - then special keywords for the CPUs designation, among:
+     - package (socket) number
+     - die number (CCD)
+     - L3 number (CCX)
+     - cluster type (big/performant, medium, little/efficient)
+     - use of SMT or not, and which ones
+  - maybe optional numbers before these to indicate (any two of them),
+    e.g. "4P" to indicate "4 performance cores".
+
+Question: how would we designate "only P cores of socket 0" ? Or
+          "only thread 0 of all P cores" ?
+
+One benefit of such a declaration method is that it can make nbthread often
+useless and automatic while still portable across a whole fleet of servers. E.g.
+if "cpu-map all S0P*T0" would designate thread 0 of all P-cores of socket 0, it
+would mean the same on all machines.
+
+Another benefit is that we can make cpu-map and automatic detection more
+exclusive:
+  - cpu-map all => equivalent of taskset, leaves auto-detection on
+  - cpu-map thr => disables auto-detection
+
+So in the end:
+  - cpu-map all restricts the CPUs the process may use
+    -> auto-detection starts from here and sorts them
+  - thread-groups offers more "buckets" to arrange distant CPUs in the
+    same process
+  - nbthread limits the number of threads we'll use
+    -> pick the most suited ones (at least thr.min, at most thr.max)
+       and distribute them optimally among the number of thread groups.
+
+One question remains: is it always possible to automatically configure
+thread-groups ? Maybe it's possible after the detection to set an optimal
+one between grp.min and grp.max ? (e.g. socket count, core types, etc).
+
+It still seems that a policy such as "optimize-for resources|perfomance"
+would still help quite a bit.
+
+-> what defines a match between a CPU core and a group:
+   - cluster identification:
+     - either cluster_cpus if present (and sometimes), or:
+     - pkg+die+ccd number
+     - same LLC instance (L3 if present, L2 if no L3 etc)
+     - CPU core model ("model" on x86, "CPU part" on arm)
+     - number of SMT per core
+   - speed if known:
+     - /sys/devices/system/cpu/cpu0/acpi_cppc/nominal_perf if available
+     - or /sys/devices/system/cpu/cpu15/cpufreq/scaling_max_freq +/- 10%
+
+PB: on intel P+E, clusters of E cores sharing the same L2+L3, but P cores are
+alone on their L3 => poor grouping.
+
+Maybe one approach could be to characterize how L3/L2 are used. E.g. on the
+14900, we have:
+  - L3 0     => all cpus there
+  - L2 0..7  => 1C2T per L2
+  - L2 8..11 => 4C4T per L2
+  => it's obvious that CPUs connected to L2 #8..11 are not the same as those
+     on L2 #0..7. We could make something with them.
+  => it does not make sense to ditch the L2 distinction due to L3 being
+     present and the same, though it doesn't make sense to use L3 either.
+     Maybe elements with a cardinality of 1 should just be ignored. E.g.
+     cores per cache == 1 => ignore L2. Probably not true per die/pkg
+     though.
+  => replace absent or irrelevant info with "?"
+
+Note that for caches we have the list of CPUs, not the list of cores, so
+we need to remap that invidivually to cores.
+
+Warning: die_id, core_id etc are per socket, not per system. Worse, on Altra,
+core_id has gigantic values (multiples of +1 and +256). However core_cpus_list
+indicates other threads and could be a solution to create our own global core
+ID. Also, cluster_id=-1 found on all cores for A8040 on kernel 6.1.
+
+Note that LLC is always the first discriminator. But within a same LLC we can
+have the issues above (e.g. 14900).
+
+Would an intermediate approach like this work ?
+-----------------------------------------------
+  1) first split by LLC (also test with L3-less A8040, N2800, x5-8350)
+  2) within LLC, check of we have different cores (model, perf, freq?)
+     and resplit
+  3) divide again so that no group has more than 64 CPUs
+
+  => it looks like from the beginning that's what we're trying to do:
+     preserve locality first, then possibly trim down the number of cores
+     if some don't bring sufficient benefit. It possibly avoids the need
+     to identify dies etc. It still doesn't completely solve the 14900
+     though.
+
+Multi-die CPUs worth checking:
+  Pentium-D (Presler, Dempsey: two 65nm dies)
+  Core2Quad Q6600/Q6700 (Kentsfield, Clowertown: two 65nm dual-core dies)
+  Core2Quad Q8xxx/Q9xxx (Yorkfield, Harpertown, Tigerton: two 45nm dual-core dies)
+  - atom 330 ("diamondville") is really a dual-die
+  - note that atom x3-z8350 ("cherry trail"), N2800 ("cedar trail") and D510
+    ("pine trail") are single-die (verified) but have two L2 caches and no L3.
+  Note that these are apparently not identified as multi-die (Q6600 has die=0).
+
+It *seems* that in order to form groups we'll first have to sort by topology,
+and only after that sort by performance so as to choose preferred CPUs.
+Otherwise we could end up trying to form inter-socket CPU groups first in
+case we're forced to mix adjacent CPUs due to too many groups.
+
+
+2025-01-07
+----------
+
+What is needed in fact is to act on two directions:
+
+  - binding restrictions: the user doesn't want the process to run on
+    second node, on efficient cores, second thread of each core, so
+    they're indicating where (not) to bind. This is a strict choice,
+    and it overrides taskset. That's the process-wide cpu-map.
+
+  - user preferences / execution profile: the user expresses their wishes
+    about how to allocate resources. This is only a binding order strategy
+    among a few existing ones that help easily decide which cores to select.
+    In this case CPUs are not enumerated. We can imagine choices as:
+
+      - full : use all permitted cores
+      - performance: use all permitted performance cores (all sockets)
+      - single-node: (like today): use all cores of a single node
+      - balanced: use a reasonable amount of perf cores (e.g. all perf
+        cores of a single socket)
+      - resources: use a single cluster of efficient cores
+      - minimal: use a single efficient core
+
+By sorting CPUs first on the performance, then applying the filtering based on
+the profile to eliminate more CPUs, then applying the limit on the desired max
+number of threads, then sorting again on the topology, it should be possible to
+draw a list of usable CPUs that can then be split in groups along the L3s.
+
+It even sounds likely that the CPU profile or allocation strategy will affect
+the first sort method. E.g:
+  - full: no sort needed though we'll use the same as perf so as to enable
+    the maximum possible high-perf threads when #threads is limited
+  - performance: probably that we should invert the topology so as to maximize
+    memory bandwidth across multiple sockets, i.e. visite node1.core0 just
+    after node0.core0 etc, and visit their threads later.
+  - bandwidth: that could be the same as "performance" one above in fact
+  - (low-)latency: better stay local first
+  - balanced: sort by perf then sockets (i.e. P0, P1, E0, E1)
+  - resources: sort on perf first.
+  - etc
+
+The strategy will also help determine the number of threads when it's not fixed
+in the configuration.
+
+Plan:
+  1) make the profile configurable and implement the sort:
+     - option name? cpu-tuning, cpu-strategy, cpu-policy, cpu-allocation,
+       cpu-selection, cpu-priority, cpu-optimize-for, cpu-prefer, cpu-favor,
+       cpu-profile
+       => cpu-selection
+
+  2) make the process-wide cpu-map configurable
+  3) extend cpu-map to make it possible to designate symbolic groups
+     (e.g. "ht0/ht1, node 0, 3*CCD, etc)
+
+Also, offering an option to the user to see how haproxy sees the CPUs and the
+bindings for various profiles would be a nice improvement helping them make
+educated decisions instead of trying blindly.
+
+2025-01-11
+----------
+
+Configuration profile: there are multiple dimensions:
+  - preferences between cores types
+  - never use a given cpu type
+  - never use a given cpu location
+
+Better use something like:
+  - ignore-XXX   -> never use XXX
+  - avoid-XXX    -> prefer not to use XXX
+  - prefer-XXX   -> prefer to use XXX
+  - restrict-XXX -> only use XXX
+
+"XXX" could be "single-threaded", "dual-threaded", "first-thread",
+"second-thread", "first-socket", "second-socket", "slowest", "fastest",
+"node-XXX" etc.
+
+We could then have:
+  - cpu-selection restrict-first-socket,ignore-slowest,...
+
+Then some of the keywords could simply be shortcuts for these.
+
+2025-01-30
+----------
+Problem: we need to set the restrictions first to eliminate undesired CPUs,
+         then sort according to the desired preferences so as to pick what
+         is considered the best CPUs. So the preference really looks like
+         a different setting.
+
+More precisely, the final stategy involves multiple criteria. For example,
+let's say that the number of threads is set to 4 and we've restricted ourselves
+to using the first thread of each CPU core. We're on an EPYC74F3, there are 3
+cores per CCX. One algorithm (resource) would create one group with 3 threads
+on the first CCX and 1 group of 1 thread on the next one, then let each of
+these threads bind to all the enabled CPU cores of their respective groups.
+Another algo (performance) would avoid sharing and would want to place one
+thread per CCX, causing the creation of 4 groups of 1 thread each. A third
+algo (balanced) would probably say that 4 threads require 2 CCX hence 2
+groups, thus there should be 2 threads per group, and it would bind 2 threads
+on all cores of the first CCX and the 2 remaining ones on the second.
+
+And if the thread count is not set, these strategies will also do their best
+to figure the optimal count. Resource would probably use 1 core max, moderate
+one CCX max, balanced one node max, performance all of them.
+
+This means that these CPU selection strategries should provide multiple
+functions:
+  - how to sort CPUs
+  - how to count how many is best within imposed rules
+
+The other actions seem to only be static. This also means that "avoid" or
+"prefer" should maybe not be used in the end, even in the sorting algo ?
+
+Or maybe these are just enums or bits in a strategy and all are considered
+at the same time everywhere. For example the thread counting could consider
+the presence of "avoid-XXX" during the operations. But how to codify XXX is
+complicated then.
+
+Maybe a scoring system could work:
+  - default: all CPUs score = 1000
+  - ignore-XXX: foreach(XXX) set score to 0
+  - restrict-XXX: foreach(YYY not XXX), set score to 0
+  - avoid-XXX: foreach(XXX) score *= 0.8
+  - prefer-XXX: foreach(XXX) score *= 1.25
+
+This supports being ignored for up to 30 different reasons before being
+permanently disabled, which is sufficient.
+
+Then sort according to score, and pick at least min_thr CPUs and continue as
+long as not max_thr or score < 1000 ("avoid"). This gives the thread count. It
+does not permit anything inter-CPU though. E.g. large vs medium vs small cores,
+or sort by locality or frequency. But maybe these ones would use a different
+strategy then and would use the score as a second sorting key (after which
+one?). Or maybe there would be 2 passes, one which avoids <1000 and another
+one which completes up to #min_thr including those <1000, in which case we
+never sort per score.
+
+We can do a bit better to respect the tgrp min/max as well: we can count what
+it implies in terms of number of tgrps (#LLC or clusters) and decide to refrain
+from adding theads which would exceed max_tgrp, but we'd possibly continue to
+add score<1000 CPUs until at least enough threads to reach min_tgrp.
+
+######## new captures ###########
+CIX-P1 / radxa Orion O6 (no topology exported):
+$ ~/haproxy/haproxy  -dc -f /dev/null
+grp=[1..12] thr=[1..12]
+first node = 0
+Note: threads already set to 12
+going to start with nbthread=12 nbtgroups=1
+[keep] thr=  0 -> cpu=  0 pk=00 no=-1 di=00 cl=000 ts=000 capa=1024
+[keep] thr=  1 -> cpu=  1 pk=00 no=-1 di=00 cl=000 ts=001 capa=278
+[keep] thr=  2 -> cpu=  2 pk=00 no=-1 di=00 cl=000 ts=002 capa=278
+[keep] thr=  3 -> cpu=  3 pk=00 no=-1 di=00 cl=000 ts=003 capa=278
+[keep] thr=  4 -> cpu=  4 pk=00 no=-1 di=00 cl=000 ts=004 capa=278
+[keep] thr=  5 -> cpu=  5 pk=00 no=-1 di=00 cl=000 ts=005 capa=905
+[keep] thr=  6 -> cpu=  6 pk=00 no=-1 di=00 cl=000 ts=006 capa=905
+[keep] thr=  7 -> cpu=  7 pk=00 no=-1 di=00 cl=000 ts=007 capa=866
+[keep] thr=  8 -> cpu=  8 pk=00 no=-1 di=00 cl=000 ts=008 capa=866
+[keep] thr=  9 -> cpu=  9 pk=00 no=-1 di=00 cl=000 ts=009 capa=984
+[keep] thr= 10 -> cpu= 10 pk=00 no=-1 di=00 cl=000 ts=010 capa=984
+[keep] thr= 11 -> cpu= 11 pk=00 no=-1 di=00 cl=000 ts=011 capa=1024
+########
+
+2025-02-25 - clarification on the configuration
+-----------------------------------------------
+
+The "two dimensions" above can in fact be summarized like this:
+
+  - exposing the ability for the user to perform the same as "taskset",
+    i.e. restrict the usage to a static subset of the CPUs. We could then
+    have "cpu-set only-node0", "0-39", "ignore-smt1", "ignore-little", etc.
+    => the user defines precise sets to be kept/evicted.
+
+  - then letting the user express what they want to do with the remaining
+    cores. This is a strategy/policy that is used to:
+      - count the optimal number of threads (when not forced), also keeping
+        in mind that it cannot be more than 32/64 * maxtgroups if set.
+      - sort CPUs by order of preference (for when threads are forced or
+        a thread-hard-limit is set).
+
+    It can, partially overlap with the first one. For example, the default
+    strategy could be to focus on a single node. If the user has limited its
+    usage to cores of both nodes, the policy could still further limit this.
+    But this time it should only be a matter of sorting and preference, i.e.
+    nbthread and cpuset are respected. If a policy prefers the node with more
+    cores first, it will sort them according to this, and its algorithm for
+    counting cores will only be used if nbthread is not set, otherwise it may
+    very well end up on two nodes to respect the user's choice.
+
+And once all of this is done, thread groups should be formed based on the
+remaining topology. Similarly, if the number of tgroups is not set, the
+algorithm must try to propose one based on the topology and the maxtgroups
+setting (i.e. find a divider of the #LLC that's lower than or equal to
+maxtgroups), otherwise the configured number of tgroups is respected. Then
+the number of LLCs will be divided by this number of tgroups, and as many
+threads as enabled CPUs of each LLC will be assigned to these respective
+groups.
+
+In the end we should have groups bound to cpu sets, and threads belonging
+to groups mapped to all accessible cpus of these groups.
+
+Note: clusters may be finer than LLCs because they could report finer
+information. We could have a big and a medium cluster share the same L3
+for example. However not all boards report their cluster number (see CIX-P1
+above). However the info about the capacity still allows to figure that and
+should probably be used for that. At this point it would seem logical to say
+that the cluster number is re-adjusted based on the claimed capacity, at
+least to avoid accidentally mixing workloads on heterogenous cores. But
+sorting by cluster number might not necessarily work if allocated randomly.
+So we might need a distinct metric that doesn't require to override the
+system's numbering, like a "set", "group", "team", "bond", "bunch", "club",
+"band", ... that would be first sorted based on LLC (and no finer), and
+second based on capacity, then on L2 etc. This way we should be able to
+respect topology when forming groups.
+
+Note: We need to consider as LLC a level which has more than one core!
+      Otherwise it's supposed to exist and be unique/shared but not reported.
+      => maybe this should be done very early when counting CPUs ?
+         We need to store the LLC level somewhere in the topo.
author	Willy Tarreau <w@1wt.eu>
	Tue, 5 Sep 2023 07:20:15 +0000 (09:20 +0200)
committer	Willy Tarreau <w@1wt.eu>
	Fri, 14 Mar 2025 17:30:30 +0000 (18:30 +0100)