releases/4.14.117/perf-x86-amd-update-generic-hardware-cache-events-for-family-17h.patch

   1 From 0e3b74e26280f2cf8753717a950b97d424da6046 Mon Sep 17 00:00:00 2001
   2 From: Kim Phillips <kim.phillips@amd.com>
   3 Date: Thu, 2 May 2019 15:29:47 +0000
   4 Subject: perf/x86/amd: Update generic hardware cache events for Family 17h
   5 MIME-Version: 1.0
   6 Content-Type: text/plain; charset=UTF-8
   7 Content-Transfer-Encoding: 8bit
   8
   9 From: Kim Phillips <kim.phillips@amd.com>
  10
  11 commit 0e3b74e26280f2cf8753717a950b97d424da6046 upstream.
  12
  13 Add a new amd_hw_cache_event_ids_f17h assignment structure set
  14 for AMD families 17h and above, since a lot has changed.  Specifically:
  15
  16 L1 Data Cache
  17
  18 The data cache access counter remains the same on Family 17h.
  19
  20 For DC misses, PMCx041's definition changes with Family 17h,
  21 so instead we use the L2 cache accesses from L1 data cache
  22 misses counter (PMCx060,umask=0xc8).
  23
  24 For DC hardware prefetch events, Family 17h breaks compatibility
  25 for PMCx067 "Data Prefetcher", so instead, we use PMCx05a "Hardware
  26 Prefetch DC Fills."
  27
  28 L1 Instruction Cache
  29
  30 PMCs 0x80 and 0x81 (32-byte IC fetches and misses) are backward
  31 compatible on Family 17h.
  32
  33 For prefetches, we remove the erroneous PMCx04B assignment which
  34 counts how many software data cache prefetch load instructions were
  35 dispatched.
  36
  37 LL - Last Level Cache
  38
  39 Removing PMCs 7D, 7E, and 7F assignments, as they do not exist
  40 on Family 17h, where the last level cache is L3.  L3 counters
  41 can be accessed using the existing AMD Uncore driver.
  42
  43 Data TLB
  44
  45 On Intel machines, data TLB accesses ("dTLB-loads") are assigned
  46 to counters that count load/store instructions retired.  This
  47 is inconsistent with instruction TLB accesses, where Intel
  48 implementations report iTLB misses that hit in the STLB.
  49
  50 Ideally, dTLB-loads would count higher level dTLB misses that hit
  51 in lower level TLBs, and dTLB-load-misses would report those
  52 that also missed in those lower-level TLBs, therefore causing
  53 a page table walk.  That would be consistent with instruction
  54 TLB operation, remove the redundancy between dTLB-loads and
  55 L1-dcache-loads, and prevent perf from producing artificially
  56 low percentage ratios, i.e. the "0.01%" below:
  57
  58         42,550,869      L1-dcache-loads
  59         41,591,860      dTLB-loads
  60              4,802      dTLB-load-misses          #    0.01% of all dTLB cache hits
  61          7,283,682      L1-dcache-stores
  62          7,912,392      dTLB-stores
  63                310      dTLB-store-misses
  64
  65 On AMD Families prior to 17h, the "Data Cache Accesses" counter is
  66 used, which is slightly better than load/store instructions retired,
  67 but still counts in terms of individual load/store operations
  68 instead of TLB operations.
  69
  70 So, for AMD Families 17h and higher, this patch assigns "dTLB-loads"
  71 to a counter for L1 dTLB misses that hit in the L2 dTLB, and
  72 "dTLB-load-misses" to a counter for L1 DTLB misses that caused
  73 L2 DTLB misses and therefore also caused page table walks.  This
  74 results in a much more accurate view of data TLB performance:
  75
  76         60,961,781      L1-dcache-loads
  77              4,601      dTLB-loads
  78                963      dTLB-load-misses          #   20.93% of all dTLB cache hits
  79
  80 Note that for all AMD families, data loads and stores are combined
  81 in a single accesses counter, so no 'L1-dcache-stores' are reported
  82 separately, and stores are counted with loads in 'L1-dcache-loads'.
  83
  84 Also note that the "% of all dTLB cache hits" string is misleading
  85 because (a) "dTLB cache": although TLBs can be considered caches for
  86 page tables, in this context, it can be misinterpreted as data cache
  87 hits because the figures are similar (at least on Intel), and (b) not
  88 all those loads (technically accesses) technically "hit" at that
  89 hardware level.  "% of all dTLB accesses" would be more clear/accurate.
  90
  91 Instruction TLB
  92
  93 On Intel machines, 'iTLB-loads' measure iTLB misses that hit in the
  94 STLB, and 'iTLB-load-misses' measure iTLB misses that also missed in
  95 the STLB and completed a page table walk.
  96
  97 For AMD Family 17h and above, for 'iTLB-loads' we replace the
  98 erroneous instruction cache fetches counter with PMCx084
  99 "L1 ITLB Miss, L2 ITLB Hit".
 100
 101 For 'iTLB-load-misses' we still use PMCx085 "L1 ITLB Miss,
 102 L2 ITLB Miss", but set a 0xff umask because without it the event
 103 does not get counted.
 104
 105 Branch Predictor (BPU)
 106
 107 PMCs 0xc2 and 0xc3 continue to be valid across all AMD Families.
 108
 109 Node Level Events
 110
 111 Family 17h does not have a PMCx0e9 counter, and corresponding counters
 112 have not been made available publicly, so for now, we mark them as
 113 unsupported for Families 17h and above.
 114
 115 Reference:
 116
 117   "Open-Source Register Reference For AMD Family 17h Processors Models 00h-2Fh"
 118   Released 7/17/2018, Publication #56255, Revision 3.03:
 119   https://www.amd.com/system/files/TechDocs/56255_OSRR.pdf
 120
 121 [ mingo: tidied up the line breaks. ]
 122 Signed-off-by: Kim Phillips <kim.phillips@amd.com>
 123 Cc: <stable@vger.kernel.org> # v4.9+
 124 Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
 125 Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
 126 Cc: Borislav Petkov <bp@alien8.de>
 127 Cc: H. Peter Anvin <hpa@zytor.com>
 128 Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
 129 Cc: Jiri Olsa <jolsa@redhat.com>
 130 Cc: Linus Torvalds <torvalds@linux-foundation.org>
 131 Cc: Martin Liška <mliska@suse.cz>
 132 Cc: Namhyung Kim <namhyung@kernel.org>
 133 Cc: Peter Zijlstra <peterz@infradead.org>
 134 Cc: Pu Wen <puwen@hygon.cn>
 135 Cc: Stephane Eranian <eranian@google.com>
 136 Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
 137 Cc: Thomas Gleixner <tglx@linutronix.de>
 138 Cc: Thomas Lendacky <Thomas.Lendacky@amd.com>
 139 Cc: Vince Weaver <vincent.weaver@maine.edu>
 140 Cc: linux-kernel@vger.kernel.org
 141 Cc: linux-perf-users@vger.kernel.org
 142 Fixes: e40ed1542dd7 ("perf/x86: Add perf support for AMD family-17h processors")
 143 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 144 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 145
 146 ---
 147  arch/x86/events/amd/core.c |  111 +++++++++++++++++++++++++++++++++++++++++++--
 148  1 file changed, 108 insertions(+), 3 deletions(-)
 149
 150 --- a/arch/x86/events/amd/core.c
 151 +++ b/arch/x86/events/amd/core.c
 152 @@ -116,6 +116,110 @@ static __initconst const u64 amd_hw_cach
 153   },
 154  };
 155
 156 +static __initconst const u64 amd_hw_cache_event_ids_f17h
 157 +                               [PERF_COUNT_HW_CACHE_MAX]
 158 +                               [PERF_COUNT_HW_CACHE_OP_MAX]
 159 +                               [PERF_COUNT_HW_CACHE_RESULT_MAX] = {
 160 +[C(L1D)] = {
 161 +       [C(OP_READ)] = {
 162 +               [C(RESULT_ACCESS)] = 0x0040, /* Data Cache Accesses */
 163 +               [C(RESULT_MISS)]   = 0xc860, /* L2$ access from DC Miss */
 164 +       },
 165 +       [C(OP_WRITE)] = {
 166 +               [C(RESULT_ACCESS)] = 0,
 167 +               [C(RESULT_MISS)]   = 0,
 168 +       },
 169 +       [C(OP_PREFETCH)] = {
 170 +               [C(RESULT_ACCESS)] = 0xff5a, /* h/w prefetch DC Fills */
 171 +               [C(RESULT_MISS)]   = 0,
 172 +       },
 173 +},
 174 +[C(L1I)] = {
 175 +       [C(OP_READ)] = {
 176 +               [C(RESULT_ACCESS)] = 0x0080, /* Instruction cache fetches  */
 177 +               [C(RESULT_MISS)]   = 0x0081, /* Instruction cache misses   */
 178 +       },
 179 +       [C(OP_WRITE)] = {
 180 +               [C(RESULT_ACCESS)] = -1,
 181 +               [C(RESULT_MISS)]   = -1,
 182 +       },
 183 +       [C(OP_PREFETCH)] = {
 184 +               [C(RESULT_ACCESS)] = 0,
 185 +               [C(RESULT_MISS)]   = 0,
 186 +       },
 187 +},
 188 +[C(LL)] = {
 189 +       [C(OP_READ)] = {
 190 +               [C(RESULT_ACCESS)] = 0,
 191 +               [C(RESULT_MISS)]   = 0,
 192 +       },
 193 +       [C(OP_WRITE)] = {
 194 +               [C(RESULT_ACCESS)] = 0,
 195 +               [C(RESULT_MISS)]   = 0,
 196 +       },
 197 +       [C(OP_PREFETCH)] = {
 198 +               [C(RESULT_ACCESS)] = 0,
 199 +               [C(RESULT_MISS)]   = 0,
 200 +       },
 201 +},
 202 +[C(DTLB)] = {
 203 +       [C(OP_READ)] = {
 204 +               [C(RESULT_ACCESS)] = 0xff45, /* All L2 DTLB accesses */
 205 +               [C(RESULT_MISS)]   = 0xf045, /* L2 DTLB misses (PT walks) */
 206 +       },
 207 +       [C(OP_WRITE)] = {
 208 +               [C(RESULT_ACCESS)] = 0,
 209 +               [C(RESULT_MISS)]   = 0,
 210 +       },
 211 +       [C(OP_PREFETCH)] = {
 212 +               [C(RESULT_ACCESS)] = 0,
 213 +               [C(RESULT_MISS)]   = 0,
 214 +       },
 215 +},
 216 +[C(ITLB)] = {
 217 +       [C(OP_READ)] = {
 218 +               [C(RESULT_ACCESS)] = 0x0084, /* L1 ITLB misses, L2 ITLB hits */
 219 +               [C(RESULT_MISS)]   = 0xff85, /* L1 ITLB misses, L2 misses */
 220 +       },
 221 +       [C(OP_WRITE)] = {
 222 +               [C(RESULT_ACCESS)] = -1,
 223 +               [C(RESULT_MISS)]   = -1,
 224 +       },
 225 +       [C(OP_PREFETCH)] = {
 226 +               [C(RESULT_ACCESS)] = -1,
 227 +               [C(RESULT_MISS)]   = -1,
 228 +       },
 229 +},
 230 +[C(BPU)] = {
 231 +       [C(OP_READ)] = {
 232 +               [C(RESULT_ACCESS)] = 0x00c2, /* Retired Branch Instr.      */
 233 +               [C(RESULT_MISS)]   = 0x00c3, /* Retired Mispredicted BI    */
 234 +       },
 235 +       [C(OP_WRITE)] = {
 236 +               [C(RESULT_ACCESS)] = -1,
 237 +               [C(RESULT_MISS)]   = -1,
 238 +       },
 239 +       [C(OP_PREFETCH)] = {
 240 +               [C(RESULT_ACCESS)] = -1,
 241 +               [C(RESULT_MISS)]   = -1,
 242 +       },
 243 +},
 244 +[C(NODE)] = {
 245 +       [C(OP_READ)] = {
 246 +               [C(RESULT_ACCESS)] = 0,
 247 +               [C(RESULT_MISS)]   = 0,
 248 +       },
 249 +       [C(OP_WRITE)] = {
 250 +               [C(RESULT_ACCESS)] = -1,
 251 +               [C(RESULT_MISS)]   = -1,
 252 +       },
 253 +       [C(OP_PREFETCH)] = {
 254 +               [C(RESULT_ACCESS)] = -1,
 255 +               [C(RESULT_MISS)]   = -1,
 256 +       },
 257 +},
 258 +};
 259 +
 260  /*
 261   * AMD Performance Monitor K7 and later, up to and including Family 16h:
 262   */
 263 @@ -861,9 +965,10 @@ __init int amd_pmu_init(void)
 264                 x86_pmu.amd_nb_constraints = 0;
 265         }
 266
 267 -       /* Events are common for all AMDs */
 268 -       memcpy(hw_cache_event_ids, amd_hw_cache_event_ids,
 269 -              sizeof(hw_cache_event_ids));
 270 +       if (boot_cpu_data.x86 >= 0x17)
 271 +               memcpy(hw_cache_event_ids, amd_hw_cache_event_ids_f17h, sizeof(hw_cache_event_ids));
 272 +       else
 273 +               memcpy(hw_cache_event_ids, amd_hw_cache_event_ids, sizeof(hw_cache_event_ids));
 274
 275         return 0;
 276  }