drm/xe: Convert GT stats to per-cpu counters
Current GT statistics use atomic64_t counters. Atomic operations incur
a global coherency penalty.
Transition to dynamic per-cpu counters using alloc_percpu(). This allows
stats to be incremented via this_cpu_add(), which compiles to a single
non-locking instruction. This approach keeps the hot-path updates local
to the CPU, avoiding expensive cross-core cache invalidation traffic.
Use for_each_possible_cpu() during aggregation and clear operations to
ensure data consistency across CPU hotplug events.
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Link: https://patch.msgid.link/20260217200552.596718-1-matthew.brost@intel.com