gh-146527: Heap-allocate gc_stats to avoid bloating PyInterpreterState (#148057)
The gc_stats struct contains ring buffers of gc_generation_stats
entries (11 young + 3×2 old on default builds). Embedding it inline
in _gc_runtime_state, which is itself inline in PyInterpreterState,
pushed fields like _gil.locked and threads.head to offsets beyond
what out-of-process profilers and debuggers can reasonably read in
a single buffer (e.g. offset 9384 for _gil.locked vs an 8 KiB read
buffer).
Heap-allocate generation_stats via PyMem_RawCalloc in _PyGC_Init and
free it in _PyGC_Fini. This shrinks PyInterpreterState by ~1.6 KiB
and keeps the GIL, thread-list, and other frequently-inspected fields
at stable, low offsets.