--- /dev/null
+2023-07-04 - automatic grouping for NUMA
+
+
+Xeon: (W2145)
+
+willy@debian:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0,8
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0,8
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,8
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-15
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+
+Wtap: i7-8650U
+
+willy@wtap:~ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0,4
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0,4
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,4
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-7
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+
+pcw: i7-6700k
+
+willy@pcw:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0,4
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0,4
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,4
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-7
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+
+nfs: N5105, v5.15
+
+willy@nfs:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-3
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-3
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+
+eeepc: Atom N2800, 5.4 : no L3, L2 not shared.
+
+willy@eeepc:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+
+willy@eeepc:~$ grep '' /sys/devices/system/cpu/cpu2/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu2/cache/index0/shared_cpu_list:2-3
+/sys/devices/system/cpu/cpu2/cache/index1/shared_cpu_list:2-3
+/sys/devices/system/cpu/cpu2/cache/index2/shared_cpu_list:2-3
+/sys/devices/system/cpu/cpu2/cache/index0/type:Data
+/sys/devices/system/cpu/cpu2/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu2/cache/index2/type:Unified
+
+
+dev13: Ryzen 2700X
+
+haproxy@dev13:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-7
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+haproxy@dev13:~$ grep '' /sys/devices/system/cpu/cpu8/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8-9
+/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8-9
+/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8-9
+/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-15
+/sys/devices/system/cpu/cpu8/cache/index0/type:Data
+/sys/devices/system/cpu/cpu8/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu8/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu8/cache/index3/type:Unified
+
+
+dev12: Ryzen 5800X
+
+haproxy@dev12:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0,8
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0,8
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,8
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-15
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+
+amd24: EPYC 74F3
+
+willy@mt:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0,24
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0,24
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,24
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-2,24-26
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+willy@mt:~$ grep '' /sys/devices/system/cpu/cpu8/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,32
+/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,32
+/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,32
+/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:6-8,30-32
+/sys/devices/system/cpu/cpu8/cache/index0/type:Data
+/sys/devices/system/cpu/cpu8/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu8/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu8/cache/index3/type:Unified
+
+willy@mt:~$ grep '' /sys/devices/system/cpu/cpu0/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0,24
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-47
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0-47
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-47
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0,24
+
+
+xeon24: Gold 6212U
+
+willy@mt01:~$ grep '' /sys/devices/system/cpu/cpu8/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,32
+/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,32
+/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,32
+/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:0-47
+/sys/devices/system/cpu/cpu8/cache/index0/type:Data
+/sys/devices/system/cpu/cpu8/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu8/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu8/cache/index3/type:Unified
+
+
+SPR 8480+
+
+$ grep -a '' /sys/devices/system/node/node*/cpulist
+/sys/devices/system/node/node0/cpulist:0-55,112-167
+/sys/devices/system/node/node1/cpulist:56-111,168-223
+
+$ grep -a '' /sys/devices/system/cpu/cpu0/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0,112
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-55,112-167
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0-55,112-167
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-55,112-167
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0,112
+
+$ grep -a '' /sys/devices/system/cpu/cpu0/cache/*/shared_cpu_list
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0,112
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0,112
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,112
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-55,112-167
+
+
+UP Board - Atom X5-8350 : no L3, exactly like Armada8040
+
+willy@up1:~$ grep '' /sys/devices/system/cpu/cpu{0,1,2,3}/cache/index2/*list
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu2/cache/index2/shared_cpu_list:2-3
+/sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_list:2-3
+
+willy@up1:~$ grep '' /sys/devices/system/cpu/cpu0/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-3
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+Atom D510 - kernel 2.6.33
+
+$ strings -fn1 sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list: 0,2
+sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list: 0,2
+sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list: 0,2
+sys/devices/system/cpu/cpu0/cache/index0/type: Data
+sys/devices/system/cpu/cpu0/cache/index1/type: Instruction
+sys/devices/system/cpu/cpu0/cache/index2/type: Unified
+
+$ strings -fn1 sys/devices/system/cpu/cpu?/topology/*list
+sys/devices/system/cpu/cpu0/topology/core_siblings_list: 0-3
+sys/devices/system/cpu/cpu0/topology/thread_siblings_list: 0,2
+sys/devices/system/cpu/cpu1/topology/core_siblings_list: 0-3
+sys/devices/system/cpu/cpu1/topology/thread_siblings_list: 1,3
+sys/devices/system/cpu/cpu2/topology/core_siblings_list: 0-3
+sys/devices/system/cpu/cpu2/topology/thread_siblings_list: 0,2
+sys/devices/system/cpu/cpu3/topology/core_siblings_list: 0-3
+sys/devices/system/cpu/cpu3/topology/thread_siblings_list: 1,3
+
+mcbin: Armada 8040 : no L3, no difference with L3 not reported
+
+root@lg7:~# grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+
+root@lg7:~# grep '' /sys/devices/system/cpu/cpu0/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-3
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-3
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+
+Ampere/monolithic: Ampere Altra 80-26 : L3 not reported
+
+willy@ampere:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+
+willy@ampere:~$ grep '' /sys/devices/system/cpu/cpu0/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-79
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-79
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+
+Ampere/Hemisphere: Ampere Altra 80-26 : L3 not reported
+
+willy@ampere:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+
+willy@ampere:~$ grep '' /sys/devices/system/cpu/cpu0/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-79
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-79
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+willy@ampere:~$ grep '' /sys/devices/system/node/node*/cpulist
+/sys/devices/system/node/node0/cpulist:0-39
+/sys/devices/system/node/node1/cpulist:40-79
+
+
+LX2A: LX2160A => L3 not reported
+
+willy@lx2a:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-1
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+
+willy@lx2a:~$ grep '' /sys/devices/system/cpu/cpu2/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu2/cache/index0/shared_cpu_list:2
+/sys/devices/system/cpu/cpu2/cache/index1/shared_cpu_list:2
+/sys/devices/system/cpu/cpu2/cache/index2/shared_cpu_list:2-3
+/sys/devices/system/cpu/cpu2/cache/index0/type:Data
+/sys/devices/system/cpu/cpu2/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu2/cache/index2/type:Unified
+
+willy@lx2a:~$ grep '' /sys/devices/system/cpu/cpu0/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-15
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-15
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+
+Rock5B: RK3588 (big-little A76+A55)
+
+rock@rock-5b:~$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-7
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
+
+rock@rock-5b:~$ grep '' /sys/devices/system/cpu/cpu{0,4,6}/topology/*list
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-3
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-3
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+/sys/devices/system/cpu/cpu4/topology/core_cpus_list:4
+/sys/devices/system/cpu/cpu4/topology/core_siblings_list:4-5
+/sys/devices/system/cpu/cpu4/topology/die_cpus_list:4
+/sys/devices/system/cpu/cpu4/topology/package_cpus_list:4-5
+/sys/devices/system/cpu/cpu4/topology/thread_siblings_list:4
+/sys/devices/system/cpu/cpu6/topology/core_cpus_list:6
+/sys/devices/system/cpu/cpu6/topology/core_siblings_list:6-7
+/sys/devices/system/cpu/cpu6/topology/die_cpus_list:6
+/sys/devices/system/cpu/cpu6/topology/package_cpus_list:6-7
+/sys/devices/system/cpu/cpu6/topology/thread_siblings_list:6
+
+$ grep '' /sys/devices/system/cpu/cpu*/cpu_capacity
+/sys/devices/system/cpu/cpu0/cpu_capacity:414
+/sys/devices/system/cpu/cpu1/cpu_capacity:414
+/sys/devices/system/cpu/cpu2/cpu_capacity:414
+/sys/devices/system/cpu/cpu3/cpu_capacity:414
+/sys/devices/system/cpu/cpu4/cpu_capacity:1024
+/sys/devices/system/cpu/cpu5/cpu_capacity:1024
+/sys/devices/system/cpu/cpu6/cpu_capacity:1024
+/sys/devices/system/cpu/cpu7/cpu_capacity:1024
+
+
+Firefly: RK3399 (2xA72 + 4xA53) kernel 6.1.28
+
+root@firefly:~# grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+grep: /sys/devices/system/cpu/cpu0/cache/index?/shared_cpu_list: No such file or directory
+grep: /sys/devices/system/cpu/cpu0/cache/index?/type: No such file or directory
+
+root@firefly:~# grep '' /sys/devices/system/cpu/cpu*/cache/index?/{shared_cpu_list,type}
+grep: /sys/devices/system/cpu/cpu*/cache/index?/shared_cpu_list: No such file or directory
+grep: /sys/devices/system/cpu/cpu*/cache/index?/type: No such file or directory
+
+root@firefly:~# dmesg|grep cacheinfo
+[ 0.006290] cacheinfo: Unable to detect cache hierarchy for CPU 0
+[ 0.016339] cacheinfo: Unable to detect cache hierarchy for CPU 1
+[ 0.017692] cacheinfo: Unable to detect cache hierarchy for CPU 2
+[ 0.019050] cacheinfo: Unable to detect cache hierarchy for CPU 3
+[ 0.020478] cacheinfo: Unable to detect cache hierarchy for CPU 4
+[ 0.021660] cacheinfo: Unable to detect cache hierarchy for CPU 5
+[ 1.990108] cacheinfo: Unable to detect cache hierarchy for CPU 0
+
+root@firefly:~# grep '' /sys/devices/system/cpu/cpu0/topology/*
+/sys/devices/system/cpu/cpu0/topology/cluster_cpus:0f
+/sys/devices/system/cpu/cpu0/topology/cluster_cpus_list:0-3
+/sys/devices/system/cpu/cpu0/topology/cluster_id:0
+/sys/devices/system/cpu/cpu0/topology/core_cpus:01
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/core_id:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings:3f
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-5
+/sys/devices/system/cpu/cpu0/topology/package_cpus:3f
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-5
+/sys/devices/system/cpu/cpu0/topology/physical_package_id:0
+/sys/devices/system/cpu/cpu0/topology/thread_siblings:01
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+$ grep '' /sys/devices/system/cpu/cpu*/cpu_capacity
+/sys/devices/system/cpu/cpu0/cpu_capacity:381
+/sys/devices/system/cpu/cpu1/cpu_capacity:381
+/sys/devices/system/cpu/cpu2/cpu_capacity:381
+/sys/devices/system/cpu/cpu3/cpu_capacity:381
+/sys/devices/system/cpu/cpu4/cpu_capacity:1024
+/sys/devices/system/cpu/cpu5/cpu_capacity:1024
+
+
+VIM3L: S905D3 (4*A55), kernel 5.14.10
+
+$ grep '' /sys/devices/system/cpu/cpu0/topology/*
+/sys/devices/system/cpu/cpu0/topology/core_cpus:1
+/sys/devices/system/cpu/cpu0/topology/core_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/core_id:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings:f
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-3
+/sys/devices/system/cpu/cpu0/topology/die_cpus:1
+/sys/devices/system/cpu/cpu0/topology/die_cpus_list:0
+/sys/devices/system/cpu/cpu0/topology/die_id:-1
+/sys/devices/system/cpu/cpu0/topology/package_cpus:f
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-3
+/sys/devices/system/cpu/cpu0/topology/physical_package_id:0
+/sys/devices/system/cpu/cpu0/topology/thread_siblings:1
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+$ grep '' /sys/devices/system/cpu/cpu0/cache/index?/{shared_cpu_list,type}
+/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
+/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-3
+/sys/devices/system/cpu/cpu0/cache/index0/type:Data
+/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
+/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
+
+$ grep '' /sys/devices/system/cpu/cpu*/cpu_capacity
+/sys/devices/system/cpu/cpu0/cpu_capacity:1024
+/sys/devices/system/cpu/cpu1/cpu_capacity:1024
+/sys/devices/system/cpu/cpu2/cpu_capacity:1024
+/sys/devices/system/cpu/cpu3/cpu_capacity:1024
+
+
+Odroid-N2: S922X (4*A73 + 2*A53), kernel 4.9.254
+
+willy@n2:~$ grep '' /sys/devices/system/cpu/cpu*/cache/index?/{shared_cpu_list,type}
+grep: /sys/devices/system/cpu/cpu*/cache/index?/shared_cpu_list: No such file or directory
+grep: /sys/devices/system/cpu/cpu*/cache/index?/type: No such file or directory
+
+willy@n2:~$ sudo dmesg|grep -i 'cache hi'
+[ 0.649924] Unable to detect cache hierarchy for CPU 0
+
+No capacity.
+
+Note that it reports 2 physical packages!
+
+willy@n2:~$ grep '' /sys/devices/system/cpu/cpu0/topology/*
+/sys/devices/system/cpu/cpu0/topology/core_id:0
+/sys/devices/system/cpu/cpu0/topology/core_siblings:03
+/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-1
+/sys/devices/system/cpu/cpu0/topology/physical_package_id:0
+/sys/devices/system/cpu/cpu0/topology/thread_siblings:01
+/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
+
+willy@n2:~$ grep '' /sys/devices/system/cpu/cpu4/topology/*
+/sys/devices/system/cpu/cpu4/topology/core_id:2
+/sys/devices/system/cpu/cpu4/topology/core_siblings:3c
+/sys/devices/system/cpu/cpu4/topology/core_siblings_list:2-5
+/sys/devices/system/cpu/cpu4/topology/physical_package_id:1
+/sys/devices/system/cpu/cpu4/topology/thread_siblings:10
+/sys/devices/system/cpu/cpu4/topology/thread_siblings_list:4
+
+StarFive VisionFive2 - JH7110, kernel 5.15
+
+willy@starfive:~/haproxy$ ./haproxy -c -f cps3.cfg
+thr 0 -> cpu 0 onl=1 bnd=1 pk=00 no=-1 l3=-1 cl=000 l2=000 ts=000 l1=000
+thr 1 -> cpu 1 onl=1 bnd=1 pk=00 no=-1 l3=-1 cl=000 l2=000 ts=001 l1=001
+thr 2 -> cpu 2 onl=1 bnd=1 pk=00 no=-1 l3=-1 cl=000 l2=000 ts=002 l1=002
+thr 3 -> cpu 3 onl=1 bnd=1 pk=00 no=-1 l3=-1 cl=000 l2=000 ts=003 l1=003
+Configuration file is valid
+
+Graviton2 / Graviton3 ?
+
+
+On PPC64 not everything is available:
+
+ https://www.ibm.com/docs/en/linux-on-systems?topic=cpus-cpu-topology
+
+ /sys/devices/system/cpu/cpu<N>/topology/thread_siblings
+ /sys/devices/system/cpu/cpu<N>/topology/core_siblings
+ /sys/devices/system/cpu/cpu<N>/topology/book_siblings
+ /sys/devices/system/cpu/cpu<N>/topology/drawer_siblings
+
+ # lscpu -e
+ CPU NODE DRAWER BOOK SOCKET CORE L1d:L1i:L2d:L2i ONLINE CONFIGURED POLARIZATION ADDRESS
+ 0 1 0 0 0 0 0:0:0:0 yes yes horizontal 0
+ 1 1 0 0 0 0 1:1:1:1 yes yes horizontal 1
+ 2 1 0 0 0 1 2:2:2:2 yes yes horizontal 2
+ 3 1 0 0 0 1 3:3:3:3 yes yes horizontal 3
+ 4 1 0 0 0 2 4:4:4:4 yes yes horizontal 4
+ 5 1 0 0 0 2 5:5:5:5 yes yes horizontal 5
+ 6 1 0 0 0 3 6:6:6:6 yes yes horizontal 6
+ 7 1 0 0 0 3 7:7:7:7 yes yes horizontal 7
+ 8 0 1 1 1 4 8:8:8:8 yes yes horizontal 8
+ ...
+
+Intel E5-2600v2/v3 has two L3:
+ https://www.enterpriseai.news/2014/09/08/intel-ups-performance-ante-haswell-xeon-chips/
+
+More info on these, and s390's "books" (mostly L4 in fact):
+ https://groups.google.com/g/fa.linux.kernel/c/qgAxjYq8ohI
+
+########################################
+Analysis:
+ - some server ARM CPUs (Altra, LX2) do not return any L3 info though they
+ DO have some. They stop at L2.
+
+ - other CPUs like Atom N2800 and Armada 8040 do not have L3.
+
+ => there's no apparent way to detect that the server CPUs do have an L3.
+ => or maybe we should consider that it's more likely that there is one
+ than none ? Armada works much better with groups than without. It's
+ basically the same topology as N2800.
+
+ => Do we really care then ? No L3 = same L3 for everyone. The problem is
+ that those really without L3 will make a difference on L2 while the
+ other ones not. Maybe we should consider that it does not make sense
+ to cut groups on L2 (i.e. under no circumstance we'll have one group
+ per core).
+
+ => This would mean:
+ - regardless of L3, consider LLC. If the LLC has more than one
+ core per instance, it's likely the last one (not true on LX2
+ but better use 8 groups of 2 than nothing).
+
+ - otherwise if there's a single core per instance, it's unlikely
+ to be the LLC so we can imagine the LLC is unified. Note that
+ some systems such as LX2/Armada8K (and Neoverse-N1 devices as
+ well) may have 2 cores per L2, yet this doesn't allow to infer
+ anything regarding the absence of an L3. Core2-quad has 2 cores
+ per L2 with no L3, like Armada8K. LX2 has 2 cores per L2 yet does
+ have an L3 which is not necessarily reported.
+
+ - this needs to be done per {node,package} !
+ => core_siblings and thread_siblings seem to be the only portable
+ ones to figure packages and threads
+
+At the very least, when multiple nodes are possibly present, there is a
+symlink "node0", "node1" etc in the cpu entry. It requires a lookup for each
+cpu directory though while reading /sys/devices/system/node/node*/cpulist is
+much cheaper.
+
+There's some redundancy in this. Probably better approach:
+
+1) if there is more than 1 CPU:
+ - if cache/index3 exists, use its cpulist to pre-group entries.
+ - else if topology or node exists, use (node,package,die,core_siblings) to
+ group entries
+ - else pre-create a single large group
+
+2) if there is more than 1 CPU and less than max#groups:
+ - for each group, if no cache/index3 exists and cache/index2 exists and some
+ index2 entries contain at least two CPUs of different cores or a single one
+ for a 2-core system, then use that to re-split the group.
+
+ - if in the end there are too many groups, remerge some of them (?) or stick
+ to the previous layout (?)
+
+ - if in the end there are too many CPUs in a group, cut as needed, if
+ possible with an integral result (/2, /3, ...)
+
+3) L1 cache / thread_siblings should be used to associate CPUs by cores in
+ the same groups.
+
+Maybe instead it should be done bottom->top by collecting info and merging
+groups while keeping CPU lists ordered to ease later splitting.
+
+ 1) create a group per bound CPU
+ 2) based on thread_siblings, detect CPUs that are on the same core, merge
+ their groups. They may not always create similarly sized groups.
+ => eg: epyc keeps 24 groups such as {0,24}, ...
+ ryzen 2700x keeps 4 groups such as {0,1}, ...
+ rk3588 keeps 3 groups {0-3},{4-5},{6-7}
+ 3) based on cache index0/1, detect CPUs that are on the same L1 cache,
+ merge their groups. They may not always create similarly sized groups.
+ 4) based on cache index2, detect CPUs that are on the same L2 cache, merge
+ their groups. They may not always create similarly sized groups.
+ => eg: mcbin now keeps 2 groups {0-1},{2,3}
+ 5) At this point there may possibly be too many groups (still one per CPU,
+ e.g. when no cache info was found or there are many cores with their own
+ L2 like on SPR) or too large one (when all cores are indeed on the same
+ L2).
+
+ 5.1) if there are as many groups as bound CPUs, merge them all together in
+ a single one => lx2, altra, mcbin
+ 5.2) if there are still more than max#groups, merge them all together in a
+ single one since the splitting criterion is not relevant
+ 5.3) if there is a group with too many CPUs, split it in two if integral,
+ otherwise 3, etc, trying to add the least possible number of groups.
+ If too difficult (e.g. result less than half the authorized max),
+ let's just round around N/((N+63)/64).
+ 5.4) if at the end there are too many groups, warn that we can't optimize
+ the setup and are limiting ourselves to the first node or 64 CPUs.
+
+Observations:
+ - lx2 definitely works better with everything bound together than by creating
+ 8 groups (~130k rps vs ~120k rps)
+ => does this mean we should assume a unified L3 if there's no L3 info, and
+ remerge everything ? Likely Altra would benefit from this as well. mcbin
+ doesn't notice any change (within noise in both directions)
+
+ - on x86 13th gen, 2 P-cores and 8 E-cores. The P-cores support HT, not the
+ E-cores. There's no cpu_capacity there, but the cluster_id is properly set.
+ => proposal: when a machine reports both single-threaded cores and SMT,
+ consider the SMT ones bigger and use them.
+
+Problems: how should auto-detection interfer with user-settings ?
+
+- Case 1: program started with a reduced taskset
+ => current: this serves to the the thread count first, and to map default
+ threads to CPUs if they are not affected by a cpu-map.
+
+ => we want to keep that behavior (i.e. use all these threads) but only
+ change how the thread-groups are arranged.
+
+ - example: start on the first 6c12t of an EPYC74F3, should automatically
+ create 2 groups for the two sockets.
+
+ => should we brute-force all thread-groups combinations to figure how the
+ threads will spread over cpu-map and which one is better ? Or should we
+ decide to ignore input mapping as soon as there's at least one cpu-map?
+ But then which one to use ? Or should we consider that cpu-map only works
+ with explicit thread-groups ?
+
+- Case 2: taskset not involved, but nbthread and cpu-map in the config. In
+ fact a pretty standard 2.4-2.8 config.
+ => maybe the presence of cpu-map and no thread-groups should be sufficient
+ to imply a single thread-group to stay compatible ? Or maybe start as
+ many thread-groups as are referenced in cpu-map ? Seems like cpu-map and
+ thread-groups work hand-in-hand regarding topology since cpu-map
+ designates hardware CPUs so the user knows better than haproxy. Thus
+ why should be try to do better ?
+
+- Case 3: taskset not involved, nbthread not involved, cpu-map not involved,
+ only thread-groups
+ => seems like an ideal approach. Take all online CPUs and try to cut them
+ into equitable thread groups ? Or rather, since nbthreads is not forced,
+ better sort the clusters and bind to the N first clusters only ? If too
+ many groups for the clusters, then try to refine them ?
+
+- Case 4: nothing specified at all (default config, target)
+ => current: uses only one thread-group with all threads (max 64).
+ => desired: bind only to performance cores and cut them in a few groups
+ based on l3, package, cluster etc.
+
+- Case 5: nbthread only in the config
+ => might match a docker use case. No group nor cpu-map configured. Figure
+ the best group usage respecting the thread count.
+
+- Case 6: some constraints are enforced in the config (e.g. threads-hard-limit,
+ one-thread-per-core, etc).
+ => like 3, 4 or 5 but with selection adjustment.
+
+- Case 7: thread-groups and generic cpu-map 1/all, 2/all... in the config
+ => user just wants to use cpu-map as a taskset alternative
+ => need to figure number of threads first, then cut them in groups like
+ today, and only then the cpu-map are found. Can we do better ? Not sure.
+ Maybe just when cpu-map is too lax (e.g. all entries reference the same
+ CPUs). Better use a special "cpumap all/all 0-19" for this, but not
+ implemented for now.
+
+Proposal:
+ - if there is any cpu-map, disable automatic CPU assignment
+ - if there is any cpu-map, disable automatic thread group detection
+ - if taskset was forced, disable automatic CPU assignment
+
+### 2023-07-17 ###
+
+=> step 1: mark CPUs enabled at boot (cpu_detect_usable)
+// => step 2: mark CPUs referenced in cpu-map => no, no real meaning
+=> step 3: identify all CPUs topologies + NUMA (cpu_detect_topology)
+
+=> step 4: if taskset && !cpu-map, mark all non-bound CPUs as unusable (UNAVAIL ?)
+ => which is the same as saying if !cpu-map.
+=> step 5: if !cpu-map, sort usable CPUs and find the best set to use
+//=> step 6: if cpu-map, mark all non-covered CPUs are unusable => not necessarily possible if partial cpu-map
+
+=> step 7: if thread-groups && cpu-map, nothing else to do
+=> step 8: if cpu-map && !thread-groups, thread-groups=1
+=> step 9: if thread-groups && !cpu-map, use that value to cut the thread set
+=> step 10: if !cpu-map && !thread-groups, detect the optimal thread-group count
+
+=> step 11: if !cpu-map, cut the thread set into mostly fair groups and assign
+ the group numbers to CPUs; create implicit cpu-maps.
+
+Ideas:
+ - use minthr and maxthr.
+ If nbthread, minthr=maxthr=nbthread, else if taskset_forced, maxthr=taskset_thr,
+ minthr=1, else minthr=1, maxthr=cpus_enabled.
+
+ - use CPU_F_ALLOWED (or DISALLOWED?) and CPU_F_REFERENCED and CPU_F_EXCLUDED ?
+ Note: cpu-map doesn't exclude, it only includes. Taskset does exclude. Also,
+ cpu-map only includes the CPUs that will belong to the correct groups & threads.
+
+ - Usual startup: taskset presets the CPU sets and sets the thread count. Tgrp
+ defaults to 1, then threads indicated in cpu-map get their CPU assigned.
+ Other ones are not changed. If we say that cpu-map => tgrp==1 then it means
+ we can infer automatic grouping for group 1 only ?
+ => it could be said that the CPUs of all enabled groups mentioned in
+ cpu-map are considered usable, but we don't know how many of these
+ will really have threads started on.
+
+ => maybe completely ignore cpu-map instead (i.e. fall back to thread-groups 1) ?
+ => automatic detection would mean:
+ - if !cpu-map && !nbthrgrp => must automatically detect thgrp
+ - if !cpu-map => must automatically detect binding
+ - otherwise nothing
+
+Examples of problems:
+
+ thread-groups 4
+ nbthreads 128
+ cpu-map 1/all 0-63
+ cpu-map 2/all 128-191
+
+ => 32 threads per group, hence grp 1 uses 0-63 and grp 2 128-191,
+ grp 3 and grp 4 unknown, in practice on boot CPUs.
+
+ => could we demand that if one cpu-map is specified, then all groups
+ are covered ? Do we need really this after all ? i.e. let's just not
+ bind other threads and that's all (and what is written).
+
+
+Calls from haproxy.c:
+
+ cpu_detect_usable()
+ cpu_detect_topology()
+
++ thread_detect_count()
+ => compute nbtgroups
+ => compute nbthreads
+
+ thread_assign_cpus() ?
+
+ check_config_validity()
+
+
+BUGS:
+ - cpu_map[0].proc still used for the whole process in daemon mode (though not
+ in foreground mode)
+ -> whole process bound to thread group 1
+ -> binding not working in foreground
+
+ - cpu_map[x].proc ANDed with the thread's map depite thread's map apparently
+ never set
+ -> group binding ignored ?
+
+2023-09-05
+----------
+Remember to make the difference between sorting (used for grouping) and
+preference. We should avoid selecting the first CPUs as it encourages to
+use wrong grouping criteria. E.g. CPU capacity has no business being used
+for grouping, it's used for selecting. Support for HT however, does because
+it allows to pack together threads of the same core.
+
+We should also have an option to enable/disable SMT (e.g. max threads per core)
+so that we can skip siblings of cores already assigned. This can be convenient
+with network running on the other sibling.
+
+
+2024-12-26
+----------
+
+Some interesting cases about intel 14900. The CPU has 8 P-cores and 16 E-cores.
+Experiments in the lab show excellent performance by binding the network to E
+cores and haproxy to P cores. Here's how the clusters are made:
+
+$ grep -h . /sys/devices/system/cpu/cpu*/topology/package_cpus | sort |uniq -c
+ 32 ffffffff
+
+ => expected
+
+$ grep -h . /sys/devices/system/cpu/cpu*/topology/die_cpus | sort |uniq -c
+ 32 ffffffff
+
+ => all CPUs on the same die
+
+$ grep -h . /sys/devices/system/cpu/cpu*/topology/cluster_cpus | sort |uniq -c
+ 2 00000003
+ 2 0000000c
+ 2 00000030
+ 2 000000c0
+ 2 00000300
+ 2 00000c00
+ 2 00003000
+ 2 0000c000
+ 4 000f0000
+ 4 00f00000
+ 4 0f000000
+ 4 f0000000
+
+ => 1 "cluster" per core on each P-core (2 threads, 8 clusters total)
+ => 1 "cluster" per 4 E-cores (4 clusters total)
+ => It can be difficult to split that into groups by just using this topology.
+
+$ grep -h . /sys/devices/system/cpu/cpu*/cache/index3/shared_cpu_list | sort |uniq -c
+ 32 0-31
+
+ => everyone shares a uniform L3 cache
+
+$ grep -h . /sys/devices/system/cpu/cpu*/cache/index2/shared_cpu_map | sort |uniq -c
+ 2 00000003
+ 2 0000000c
+ 2 00000030
+ 2 000000c0
+ 2 00000300
+ 2 00000c00
+ 2 00003000
+ 2 0000c000
+ 4 000f0000
+ 4 00f00000
+ 4 0f000000
+ 4 f0000000
+
+ => L2 is split like the respective "clusters" above.
+
+Semms like one would like to split them into 12 groups :-/ Maybe it still
+remains relevant to consider L3 for grouping, and core performance for
+selection (e.g. evict/prefer E-cores depending on policy).
+
+Differences between P and E cores on 14900:
+
+- acpi_cppc/*perf : pretty useful but not always there (e.g. aloha)
+- cache index0: 48 vs 32k (bigger CPU has smaller cache)
+- cache index1: 32 vs 64k (smaller CPU has bigger cache)
+- cache index2: 2 vs 4M, but dedicated per core vs shared per cluster (4 cores)
+
+=> probably that the presence of a larger "cluster" with less cache per
+ avg core is an indication of a smaller CPU set. Warning however, some
+ CPUs (e.g. S922X) have a large (4) cluster of big cores and a small (2)
+ cluster of little cores.
+
+
+diff -urN cpu0/acpi_cppc/lowest_nonlinear_perf cpu16/acpi_cppc/lowest_nonlinear_perf
+--- cpu0/acpi_cppc/lowest_nonlinear_perf 2024-12-26 18:39:27.563410317 +0100
++++ cpu16/acpi_cppc/lowest_nonlinear_perf 2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-20
++15
+diff -urN cpu0/acpi_cppc/nominal_perf cpu16/acpi_cppc/nominal_perf
+--- cpu0/acpi_cppc/nominal_perf 2024-12-26 18:39:27.563410317 +0100
++++ cpu16/acpi_cppc/nominal_perf 2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-40
++24
+diff -urN cpu0/acpi_cppc/reference_perf cpu16/acpi_cppc/reference_perf
+--- cpu0/acpi_cppc/reference_perf 2024-12-26 18:39:27.563410317 +0100
++++ cpu16/acpi_cppc/reference_perf 2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-40
++24
+diff -urN cpu0/cache/index0/size cpu16/cache/index0/size
+--- cpu0/cache/index0/size 2024-12-26 18:39:27.563410317 +0100
++++ cpu16/cache/index0/size 2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-48K
++32K
+diff -urN cpu0/cache/index1/shared_cpu_list cpu16/cache/index1/shared_cpu_list
+--- cpu0/cache/index1/shared_cpu_list 2024-12-26 18:39:27.563410317 +0100
++++ cpu16/cache/index1/shared_cpu_list 2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-0-1
++16
+diff -urN cpu0/cache/index1/shared_cpu_map cpu16/cache/index1/shared_cpu_map
+--- cpu0/cache/index1/shared_cpu_map 2024-12-26 18:39:27.563410317 +0100
++++ cpu16/cache/index1/shared_cpu_map 2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-00000003
++00010000
+diff -urN cpu0/cache/index1/size cpu16/cache/index1/size
+--- cpu0/cache/index1/size 2024-12-26 18:39:27.563410317 +0100
++++ cpu16/cache/index1/size 2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-32K
++64K
+diff -urN cpu0/cache/index2/shared_cpu_list cpu16/cache/index2/shared_cpu_list
+--- cpu0/cache/index2/shared_cpu_list 2024-12-26 18:39:27.563410317 +0100
++++ cpu16/cache/index2/shared_cpu_list 2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-0-1
++16-19
+--- cpu0/cache/index2/size 2024-12-26 18:39:27.563410317 +0100
++++ cpu16/cache/index2/size 2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-2048K
++4096K
+diff -urN cpu0/topology/cluster_cpus cpu16/topology/cluster_cpus
+--- cpu0/topology/cluster_cpus 2024-12-26 18:39:27.563410317 +0100
++++ cpu16/topology/cluster_cpus 2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-00000003
++000f0000
+diff -urN cpu0/topology/cluster_cpus_list cpu16/topology/cluster_cpus_list
+--- cpu0/topology/cluster_cpus_list 2024-12-26 18:39:27.563410317 +0100
++++ cpu16/topology/cluster_cpus_list 2024-12-26 18:40:39.531408186 +0100
+@@ -1 +1 @@
+-0-1
++16-19
+
+For acpi_cppc, the values differ between machines, looks like nominal_perf
+is always usable:
+
+14900k:
+$ grep '' cpu8/acpi_cppc/*
+cpu8/acpi_cppc/feedback_ctrs:ref:85172004640 del:143944480100
+cpu8/acpi_cppc/highest_perf:255
+cpu8/acpi_cppc/lowest_freq:0
+cpu8/acpi_cppc/lowest_nonlinear_perf:20
+cpu8/acpi_cppc/lowest_perf:1
+cpu8/acpi_cppc/nominal_freq:3200
+cpu8/acpi_cppc/nominal_perf:40
+cpu8/acpi_cppc/reference_perf:40
+cpu8/acpi_cppc/wraparound_time:18446744073709551615
+
+$ grep '' cpu16/acpi_cppc/*
+cpu16/acpi_cppc/feedback_ctrs:ref:84153776128 del:112977352354
+cpu16/acpi_cppc/highest_perf:255
+cpu16/acpi_cppc/lowest_freq:0
+cpu16/acpi_cppc/lowest_nonlinear_perf:15
+cpu16/acpi_cppc/lowest_perf:1
+cpu16/acpi_cppc/nominal_freq:3200
+cpu16/acpi_cppc/nominal_perf:24
+cpu16/acpi_cppc/reference_perf:24
+cpu16/acpi_cppc/wraparound_time:18446744073709551615
+
+altra:
+$ grep '' /sys/devices/system/cpu/cpu0/acpi_cppc/*
+feedback_ctrs:ref:227098452801 del:590247062111
+highest_perf:260
+lowest_freq:1000
+lowest_nonlinear_perf:200
+lowest_perf:100
+nominal_freq:2600
+nominal_perf:260
+reference_perf:100
+
+w3-2345:
+$ grep '' /sys/devices/system/cpu/cpu0/acpi_cppc/*
+feedback_ctrs:ref:4775674480779 del:5675950973600
+highest_perf:45
+lowest_freq:0
+lowest_nonlinear_perf:8
+lowest_perf:5
+nominal_freq:0
+nominal_perf:31
+reference_perf:31
+wraparound_time:18446744073709551615
+
+Other approaches may consist in checking the CPU's max frequency via
+cpufreq, e.g on the N2:
+
+ $ grep . /sys/devices/system/cpu/cpu?/cpufreq/scaling_max_freq
+ /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq:2016000
+ /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq:2016000
+ /sys/devices/system/cpu/cpu2/cpufreq/scaling_max_freq:2400000
+ /sys/devices/system/cpu/cpu3/cpufreq/scaling_max_freq:2400000
+ /sys/devices/system/cpu/cpu4/cpufreq/scaling_max_freq:2400000
+ /sys/devices/system/cpu/cpu5/cpufreq/scaling_max_freq:2400000
+
+However on x86, the cores no longer all have the same frequency, like below on
+the W3-2345, so it cannot always be used to split them into groups, it may at
+best be used to sort them.
+
+ $ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
+ /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq:4500000
+ /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq:4500000
+ /sys/devices/system/cpu/cpu2/cpufreq/scaling_max_freq:4300000
+ /sys/devices/system/cpu/cpu3/cpufreq/scaling_max_freq:4400000
+ /sys/devices/system/cpu/cpu4/cpufreq/scaling_max_freq:4300000
+ /sys/devices/system/cpu/cpu5/cpufreq/scaling_max_freq:4300000
+ /sys/devices/system/cpu/cpu6/cpufreq/scaling_max_freq:4400000
+ /sys/devices/system/cpu/cpu7/cpufreq/scaling_max_freq:4300000
+ /sys/devices/system/cpu/cpu8/cpufreq/scaling_max_freq:4500000
+ /sys/devices/system/cpu/cpu9/cpufreq/scaling_max_freq:4500000
+ /sys/devices/system/cpu/cpu10/cpufreq/scaling_max_freq:4300000
+ /sys/devices/system/cpu/cpu11/cpufreq/scaling_max_freq:4400000
+ /sys/devices/system/cpu/cpu12/cpufreq/scaling_max_freq:4300000
+ /sys/devices/system/cpu/cpu13/cpufreq/scaling_max_freq:4300000
+ /sys/devices/system/cpu/cpu14/cpufreq/scaling_max_freq:4400000
+ /sys/devices/system/cpu/cpu15/cpufreq/scaling_max_freq:4300000
+
+On 14900, not cool either:
+
+ $ grep -h . /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq|sort|uniq -c
+ 16 4400000
+ 12 5700000
+ 4 6000000
+
+Considering that values that are within +/-10% of a cluster's min/max are still
+part of it would seem to work and would make a good rule of thumb.
+
+On x86, the model number might help, here on w3-2345:
+
+ $ grep '^model\s\s' /proc/cpuinfo |sort|uniq -c
+ 16 model : 143
+
+But not always (here: 14900K with 8xP and 16xE):
+
+ $ grep '^model\s\s' /proc/cpuinfo |sort|uniq -c
+ 32 model : 183
+
+On ARM it's rather the part number:
+
+ # a9
+ $ grep part /proc/cpuinfo
+ CPU part : 0xc09
+ CPU part : 0xc09
+
+ # a17
+ $ grep part /proc/cpuinfo
+ CPU part : 0xc0d
+ CPU part : 0xc0d
+ CPU part : 0xc0d
+ CPU part : 0xc0d
+
+ # a72
+ $ grep part /proc/cpuinfo
+ CPU part : 0xd08
+ CPU part : 0xd08
+ CPU part : 0xd08
+ CPU part : 0xd08
+
+ # a53+a72
+ $ grep part /proc/cpuinfo
+ CPU part : 0xd03
+ CPU part : 0xd03
+ CPU part : 0xd03
+ CPU part : 0xd03
+ CPU part : 0xd08
+ CPU part : 0xd08
+
+ # a53+a73
+ $ grep 'part' /proc/cpuinfo
+ CPU part : 0xd03
+ CPU part : 0xd03
+ CPU part : 0xd09
+ CPU part : 0xd09
+ CPU part : 0xd09
+ CPU part : 0xd09
+
+ # a55+a76
+ $ grep 'part' /proc/cpuinfo
+ CPU part : 0xd05
+ CPU part : 0xd05
+ CPU part : 0xd05
+ CPU part : 0xd05
+ CPU part : 0xd0b
+ CPU part : 0xd0b
+ CPU part : 0xd0b
+ CPU part : 0xd0b
+
+
+2024-12-27
+----------
+
+Such machines with P+E cores are becoming increasingly common. Some like the
+CIX-P1 can even provide 3 levels of performance: 4 big cores (A720-2.8G), 4
+medium cores (A720-2.4G), 4 little cores (A520-1.8G). Architectures like below
+will become the norm, and can be used under different policies:
+
+ +-----------------------------+
+ | L3 |
+ +---+----------+----------+---+
+ | | |
+ +---+---+ +---+---+ +---+---+
+ | P | P | | E | E | | E | E |
+ +---+---+ +---+---+ +---+---+
+ Policy: | P | P | | E | E | | E | E |
+ ------- +---+---+ +---+---+ +---+---+
+ 1 group, min: N/A 0 0
+ 1 group, max: 0 N/A N/A
+ 1 group, all: 0 0 0
+ 2 groups, min: N/A 0 1
+ 2 groups, full: 0 1 1
+ 3 groups: 0 1 2
+
+In dual-socket or multiple dies it can even become more complicated:
+
+ +---+---+ +---+---+ +---+---+
+ | P | P | | E | E | | E | E |
+ +---+---+ +---+---+ +---+---+
+ | P | P | | E | E | | E | E |
+ +---+---+ +---+---+ +---+---+
+ | | |
+ +---+----------+----------+---+
+ | L3.0 |
+ +-----------------------------+
+
+ +-----------------------------+
+ | L3.1 |
+ +---+----------+----------+---+
+ | | |
+ +---+---+ +---+---+ +---+---+
+ | P | P | | E | E | | E | E |
+ +---+---+ +---+---+ +---+---+
+ | P | P | | E | E | | E | E |
+ +---+---+ +---+---+ +---+---+
+
+Setting only a thread count would yield interesting things above:
+ 1-4T: P.0
+ 5-8T: P.0, P.1 (2 grp)
+ 9-16T: P.0, E.0, P.1, E.1 (3-4 grp)
+17-24T: PEE.0, PEE.1 (5-6 grp)
+
+With forced tgrp = 1:
+ - only fill node 0 first (P then PE, then PEE)
+
+With forced tgrp = 2:
+
+ def: P.0, P.1
+ 2-4T: P.0 only ?
+ 6-8T: P.0, P.1
+ 9-24T: PEE.0, PEE.1
+
+With dual-socket, dual-die, it becomes:
+
+ +---+---+ +---+---+ +---+---+ ' +---+---+ +---+---+ +---+---+
+ | P | P | | E | E | | E | E | ' | P | P | | E | E | | E | E |
+ +---+---+ +---+---+ +---+---+ ' +---+---+ +---+---+ +---+---+
+ | P | P | | E | E | | E | E | ' | P | P | | E | E | | E | E |
+ +---+---+ +---+---+ +---+---+ ' +---+---+ +---+---+ +---+---+
+ | | | ' | | |
+ +---+----------+----------+---+ ' +---+----------+----------+---+
+ | L3.0.0 | ' | L3.1.0 |
+ +-----------------------------+ ' +-----------------------------+
+ '
+ +-----------------------------+ ' +-----------------------------+
+ | L3.0.1 | ' | L3.1.1 |
+ +---+----------+----------+---+ ' +---+----------+----------+---+
+ | | | ' | | |
+ +---+---+ +---+---+ +---+---+ ' +---+---+ +---+---+ +---+---+
+ | P | P | | E | E | | E | E | ' | P | P | | E | E | | E | E |
+ +---+---+ +---+---+ +---+---+ ' +---+---+ +---+---+ +---+---+
+ | P | P | | E | E | | E | E | ' | P | P | | E | E | | E | E |
+ +---+---+ +---+---+ +---+---+ ' +---+---+ +---+---+ +---+---+
+
+In such conditions, it could make sense to first enumerate all the available
+cores with all their characteristics, and distribute them between "buckets"
+representing the thread groups:
+
+ 1. create the min number of tgrp (tgrp.min)
+ 2. it's possible to automatically create more until tgrp.max
+ -> cores are sorted by performance then by proximity. They're
+ distributed in order into existing buckets, and if too distant,
+ then new groups are created. It could allow for example to use
+ all P-cores in the DSDD model above, split into 4 tgrp.
+ -> the total number of threads is then discovered at the end.
+
+
+It seems in the end that such binding policies (P, E, single/multi dies,
+single/multi sockets etc) should be made more accessible to the user. What
+we're missing in "cpu-map" is the ability to apply to the whole process in
+fact, so that it can supersede taskset. Indeed, right now, cpu-map requires
+too many details and that's why it often remains easier to deal with taskset,
+particularly when dealing with thread groups.
+
+We can revisit the situation differently. First, let's keep in mind that
+cpu-map is a restriction. It means "use no more than these", it does not
+mean "use all of these". So it totally makes sense to use it to replace
+taskset at the process level without interfering with groups detection.
+We could then have:
+
+ - "cpu-map all|process|global|? ..." to apply to the whole process
+ - then special keywords for the CPUs designation, among:
+ - package (socket) number
+ - die number (CCD)
+ - L3 number (CCX)
+ - cluster type (big/performant, medium, little/efficient)
+ - use of SMT or not, and which ones
+ - maybe optional numbers before these to indicate (any two of them),
+ e.g. "4P" to indicate "4 performance cores".
+
+Question: how would we designate "only P cores of socket 0" ? Or
+ "only thread 0 of all P cores" ?
+
+One benefit of such a declaration method is that it can make nbthread often
+useless and automatic while still portable across a whole fleet of servers. E.g.
+if "cpu-map all S0P*T0" would designate thread 0 of all P-cores of socket 0, it
+would mean the same on all machines.
+
+Another benefit is that we can make cpu-map and automatic detection more
+exclusive:
+ - cpu-map all => equivalent of taskset, leaves auto-detection on
+ - cpu-map thr => disables auto-detection
+
+So in the end:
+ - cpu-map all restricts the CPUs the process may use
+ -> auto-detection starts from here and sorts them
+ - thread-groups offers more "buckets" to arrange distant CPUs in the
+ same process
+ - nbthread limits the number of threads we'll use
+ -> pick the most suited ones (at least thr.min, at most thr.max)
+ and distribute them optimally among the number of thread groups.
+
+One question remains: is it always possible to automatically configure
+thread-groups ? Maybe it's possible after the detection to set an optimal
+one between grp.min and grp.max ? (e.g. socket count, core types, etc).
+
+It still seems that a policy such as "optimize-for resources|perfomance"
+would still help quite a bit.
+
+-> what defines a match between a CPU core and a group:
+ - cluster identification:
+ - either cluster_cpus if present (and sometimes), or:
+ - pkg+die+ccd number
+ - same LLC instance (L3 if present, L2 if no L3 etc)
+ - CPU core model ("model" on x86, "CPU part" on arm)
+ - number of SMT per core
+ - speed if known:
+ - /sys/devices/system/cpu/cpu0/acpi_cppc/nominal_perf if available
+ - or /sys/devices/system/cpu/cpu15/cpufreq/scaling_max_freq +/- 10%
+
+PB: on intel P+E, clusters of E cores sharing the same L2+L3, but P cores are
+alone on their L3 => poor grouping.
+
+Maybe one approach could be to characterize how L3/L2 are used. E.g. on the
+14900, we have:
+ - L3 0 => all cpus there
+ - L2 0..7 => 1C2T per L2
+ - L2 8..11 => 4C4T per L2
+ => it's obvious that CPUs connected to L2 #8..11 are not the same as those
+ on L2 #0..7. We could make something with them.
+ => it does not make sense to ditch the L2 distinction due to L3 being
+ present and the same, though it doesn't make sense to use L3 either.
+ Maybe elements with a cardinality of 1 should just be ignored. E.g.
+ cores per cache == 1 => ignore L2. Probably not true per die/pkg
+ though.
+ => replace absent or irrelevant info with "?"
+
+Note that for caches we have the list of CPUs, not the list of cores, so
+we need to remap that invidivually to cores.
+
+Warning: die_id, core_id etc are per socket, not per system. Worse, on Altra,
+core_id has gigantic values (multiples of +1 and +256). However core_cpus_list
+indicates other threads and could be a solution to create our own global core
+ID. Also, cluster_id=-1 found on all cores for A8040 on kernel 6.1.
+
+Note that LLC is always the first discriminator. But within a same LLC we can
+have the issues above (e.g. 14900).
+
+Would an intermediate approach like this work ?
+-----------------------------------------------
+ 1) first split by LLC (also test with L3-less A8040, N2800, x5-8350)
+ 2) within LLC, check of we have different cores (model, perf, freq?)
+ and resplit
+ 3) divide again so that no group has more than 64 CPUs
+
+ => it looks like from the beginning that's what we're trying to do:
+ preserve locality first, then possibly trim down the number of cores
+ if some don't bring sufficient benefit. It possibly avoids the need
+ to identify dies etc. It still doesn't completely solve the 14900
+ though.
+
+Multi-die CPUs worth checking:
+ Pentium-D (Presler, Dempsey: two 65nm dies)
+ Core2Quad Q6600/Q6700 (Kentsfield, Clowertown: two 65nm dual-core dies)
+ Core2Quad Q8xxx/Q9xxx (Yorkfield, Harpertown, Tigerton: two 45nm dual-core dies)
+ - atom 330 ("diamondville") is really a dual-die
+ - note that atom x3-z8350 ("cherry trail"), N2800 ("cedar trail") and D510
+ ("pine trail") are single-die (verified) but have two L2 caches and no L3.
+ Note that these are apparently not identified as multi-die (Q6600 has die=0).
+
+It *seems* that in order to form groups we'll first have to sort by topology,
+and only after that sort by performance so as to choose preferred CPUs.
+Otherwise we could end up trying to form inter-socket CPU groups first in
+case we're forced to mix adjacent CPUs due to too many groups.
+
+
+2025-01-07
+----------
+
+What is needed in fact is to act on two directions:
+
+ - binding restrictions: the user doesn't want the process to run on
+ second node, on efficient cores, second thread of each core, so
+ they're indicating where (not) to bind. This is a strict choice,
+ and it overrides taskset. That's the process-wide cpu-map.
+
+ - user preferences / execution profile: the user expresses their wishes
+ about how to allocate resources. This is only a binding order strategy
+ among a few existing ones that help easily decide which cores to select.
+ In this case CPUs are not enumerated. We can imagine choices as:
+
+ - full : use all permitted cores
+ - performance: use all permitted performance cores (all sockets)
+ - single-node: (like today): use all cores of a single node
+ - balanced: use a reasonable amount of perf cores (e.g. all perf
+ cores of a single socket)
+ - resources: use a single cluster of efficient cores
+ - minimal: use a single efficient core
+
+By sorting CPUs first on the performance, then applying the filtering based on
+the profile to eliminate more CPUs, then applying the limit on the desired max
+number of threads, then sorting again on the topology, it should be possible to
+draw a list of usable CPUs that can then be split in groups along the L3s.
+
+It even sounds likely that the CPU profile or allocation strategy will affect
+the first sort method. E.g:
+ - full: no sort needed though we'll use the same as perf so as to enable
+ the maximum possible high-perf threads when #threads is limited
+ - performance: probably that we should invert the topology so as to maximize
+ memory bandwidth across multiple sockets, i.e. visite node1.core0 just
+ after node0.core0 etc, and visit their threads later.
+ - bandwidth: that could be the same as "performance" one above in fact
+ - (low-)latency: better stay local first
+ - balanced: sort by perf then sockets (i.e. P0, P1, E0, E1)
+ - resources: sort on perf first.
+ - etc
+
+The strategy will also help determine the number of threads when it's not fixed
+in the configuration.
+
+Plan:
+ 1) make the profile configurable and implement the sort:
+ - option name? cpu-tuning, cpu-strategy, cpu-policy, cpu-allocation,
+ cpu-selection, cpu-priority, cpu-optimize-for, cpu-prefer, cpu-favor,
+ cpu-profile
+ => cpu-selection
+
+ 2) make the process-wide cpu-map configurable
+ 3) extend cpu-map to make it possible to designate symbolic groups
+ (e.g. "ht0/ht1, node 0, 3*CCD, etc)
+
+Also, offering an option to the user to see how haproxy sees the CPUs and the
+bindings for various profiles would be a nice improvement helping them make
+educated decisions instead of trying blindly.
+
+2025-01-11
+----------
+
+Configuration profile: there are multiple dimensions:
+ - preferences between cores types
+ - never use a given cpu type
+ - never use a given cpu location
+
+Better use something like:
+ - ignore-XXX -> never use XXX
+ - avoid-XXX -> prefer not to use XXX
+ - prefer-XXX -> prefer to use XXX
+ - restrict-XXX -> only use XXX
+
+"XXX" could be "single-threaded", "dual-threaded", "first-thread",
+"second-thread", "first-socket", "second-socket", "slowest", "fastest",
+"node-XXX" etc.
+
+We could then have:
+ - cpu-selection restrict-first-socket,ignore-slowest,...
+
+Then some of the keywords could simply be shortcuts for these.
+
+2025-01-30
+----------
+Problem: we need to set the restrictions first to eliminate undesired CPUs,
+ then sort according to the desired preferences so as to pick what
+ is considered the best CPUs. So the preference really looks like
+ a different setting.
+
+More precisely, the final stategy involves multiple criteria. For example,
+let's say that the number of threads is set to 4 and we've restricted ourselves
+to using the first thread of each CPU core. We're on an EPYC74F3, there are 3
+cores per CCX. One algorithm (resource) would create one group with 3 threads
+on the first CCX and 1 group of 1 thread on the next one, then let each of
+these threads bind to all the enabled CPU cores of their respective groups.
+Another algo (performance) would avoid sharing and would want to place one
+thread per CCX, causing the creation of 4 groups of 1 thread each. A third
+algo (balanced) would probably say that 4 threads require 2 CCX hence 2
+groups, thus there should be 2 threads per group, and it would bind 2 threads
+on all cores of the first CCX and the 2 remaining ones on the second.
+
+And if the thread count is not set, these strategies will also do their best
+to figure the optimal count. Resource would probably use 1 core max, moderate
+one CCX max, balanced one node max, performance all of them.
+
+This means that these CPU selection strategries should provide multiple
+functions:
+ - how to sort CPUs
+ - how to count how many is best within imposed rules
+
+The other actions seem to only be static. This also means that "avoid" or
+"prefer" should maybe not be used in the end, even in the sorting algo ?
+
+Or maybe these are just enums or bits in a strategy and all are considered
+at the same time everywhere. For example the thread counting could consider
+the presence of "avoid-XXX" during the operations. But how to codify XXX is
+complicated then.
+
+Maybe a scoring system could work:
+ - default: all CPUs score = 1000
+ - ignore-XXX: foreach(XXX) set score to 0
+ - restrict-XXX: foreach(YYY not XXX), set score to 0
+ - avoid-XXX: foreach(XXX) score *= 0.8
+ - prefer-XXX: foreach(XXX) score *= 1.25
+
+This supports being ignored for up to 30 different reasons before being
+permanently disabled, which is sufficient.
+
+Then sort according to score, and pick at least min_thr CPUs and continue as
+long as not max_thr or score < 1000 ("avoid"). This gives the thread count. It
+does not permit anything inter-CPU though. E.g. large vs medium vs small cores,
+or sort by locality or frequency. But maybe these ones would use a different
+strategy then and would use the score as a second sorting key (after which
+one?). Or maybe there would be 2 passes, one which avoids <1000 and another
+one which completes up to #min_thr including those <1000, in which case we
+never sort per score.
+
+We can do a bit better to respect the tgrp min/max as well: we can count what
+it implies in terms of number of tgrps (#LLC or clusters) and decide to refrain
+from adding theads which would exceed max_tgrp, but we'd possibly continue to
+add score<1000 CPUs until at least enough threads to reach min_tgrp.
+
+######## new captures ###########
+CIX-P1 / radxa Orion O6 (no topology exported):
+$ ~/haproxy/haproxy -dc -f /dev/null
+grp=[1..12] thr=[1..12]
+first node = 0
+Note: threads already set to 12
+going to start with nbthread=12 nbtgroups=1
+[keep] thr= 0 -> cpu= 0 pk=00 no=-1 di=00 cl=000 ts=000 capa=1024
+[keep] thr= 1 -> cpu= 1 pk=00 no=-1 di=00 cl=000 ts=001 capa=278
+[keep] thr= 2 -> cpu= 2 pk=00 no=-1 di=00 cl=000 ts=002 capa=278
+[keep] thr= 3 -> cpu= 3 pk=00 no=-1 di=00 cl=000 ts=003 capa=278
+[keep] thr= 4 -> cpu= 4 pk=00 no=-1 di=00 cl=000 ts=004 capa=278
+[keep] thr= 5 -> cpu= 5 pk=00 no=-1 di=00 cl=000 ts=005 capa=905
+[keep] thr= 6 -> cpu= 6 pk=00 no=-1 di=00 cl=000 ts=006 capa=905
+[keep] thr= 7 -> cpu= 7 pk=00 no=-1 di=00 cl=000 ts=007 capa=866
+[keep] thr= 8 -> cpu= 8 pk=00 no=-1 di=00 cl=000 ts=008 capa=866
+[keep] thr= 9 -> cpu= 9 pk=00 no=-1 di=00 cl=000 ts=009 capa=984
+[keep] thr= 10 -> cpu= 10 pk=00 no=-1 di=00 cl=000 ts=010 capa=984
+[keep] thr= 11 -> cpu= 11 pk=00 no=-1 di=00 cl=000 ts=011 capa=1024
+########
+
+2025-02-25 - clarification on the configuration
+-----------------------------------------------
+
+The "two dimensions" above can in fact be summarized like this:
+
+ - exposing the ability for the user to perform the same as "taskset",
+ i.e. restrict the usage to a static subset of the CPUs. We could then
+ have "cpu-set only-node0", "0-39", "ignore-smt1", "ignore-little", etc.
+ => the user defines precise sets to be kept/evicted.
+
+ - then letting the user express what they want to do with the remaining
+ cores. This is a strategy/policy that is used to:
+ - count the optimal number of threads (when not forced), also keeping
+ in mind that it cannot be more than 32/64 * maxtgroups if set.
+ - sort CPUs by order of preference (for when threads are forced or
+ a thread-hard-limit is set).
+
+ It can, partially overlap with the first one. For example, the default
+ strategy could be to focus on a single node. If the user has limited its
+ usage to cores of both nodes, the policy could still further limit this.
+ But this time it should only be a matter of sorting and preference, i.e.
+ nbthread and cpuset are respected. If a policy prefers the node with more
+ cores first, it will sort them according to this, and its algorithm for
+ counting cores will only be used if nbthread is not set, otherwise it may
+ very well end up on two nodes to respect the user's choice.
+
+And once all of this is done, thread groups should be formed based on the
+remaining topology. Similarly, if the number of tgroups is not set, the
+algorithm must try to propose one based on the topology and the maxtgroups
+setting (i.e. find a divider of the #LLC that's lower than or equal to
+maxtgroups), otherwise the configured number of tgroups is respected. Then
+the number of LLCs will be divided by this number of tgroups, and as many
+threads as enabled CPUs of each LLC will be assigned to these respective
+groups.
+
+In the end we should have groups bound to cpu sets, and threads belonging
+to groups mapped to all accessible cpus of these groups.
+
+Note: clusters may be finer than LLCs because they could report finer
+information. We could have a big and a medium cluster share the same L3
+for example. However not all boards report their cluster number (see CIX-P1
+above). However the info about the capacity still allows to figure that and
+should probably be used for that. At this point it would seem logical to say
+that the cluster number is re-adjusted based on the claimed capacity, at
+least to avoid accidentally mixing workloads on heterogenous cores. But
+sorting by cluster number might not necessarily work if allocated randomly.
+So we might need a distinct metric that doesn't require to override the
+system's numbering, like a "set", "group", "team", "bond", "bunch", "club",
+"band", ... that would be first sorted based on LLC (and no finer), and
+second based on capacity, then on L2 etc. This way we should be able to
+respect topology when forming groups.
+
+Note: We need to consider as LLC a level which has more than one core!
+ Otherwise it's supposed to exist and be unique/shared but not reported.
+ => maybe this should be done very early when counting CPUs ?
+ We need to store the LLC level somewhere in the topo.