]>
Commit | Line | Data |
---|---|---|
088b31d9 GKH |
1 | From ce02ef06fcf7a399a6276adb83f37373d10cbbe1 Mon Sep 17 00:00:00 2001 |
2 | From: Daniel Borkmann <daniel@iogearbox.net> | |
3 | Date: Thu, 21 Feb 2019 23:19:41 +0100 | |
4 | Subject: x86, retpolines: Raise limit for generating indirect calls from switch-case | |
5 | MIME-Version: 1.0 | |
6 | Content-Type: text/plain; charset=UTF-8 | |
7 | Content-Transfer-Encoding: 8bit | |
8 | ||
9 | From: Daniel Borkmann <daniel@iogearbox.net> | |
10 | ||
11 | commit ce02ef06fcf7a399a6276adb83f37373d10cbbe1 upstream. | |
12 | ||
13 | From networking side, there are numerous attempts to get rid of indirect | |
14 | calls in fast-path wherever feasible in order to avoid the cost of | |
15 | retpolines, for example, just to name a few: | |
16 | ||
17 | * 283c16a2dfd3 ("indirect call wrappers: helpers to speed-up indirect calls of builtin") | |
18 | * aaa5d90b395a ("net: use indirect call wrappers at GRO network layer") | |
19 | * 028e0a476684 ("net: use indirect call wrappers at GRO transport layer") | |
20 | * 356da6d0cde3 ("dma-mapping: bypass indirect calls for dma-direct") | |
21 | * 09772d92cd5a ("bpf: avoid retpoline for lookup/update/delete calls on maps") | |
22 | * 10870dd89e95 ("netfilter: nf_tables: add direct calls for all builtin expressions") | |
23 | [...] | |
24 | ||
25 | Recent work on XDP from Björn and Magnus additionally found that manually | |
26 | transforming the XDP return code switch statement with more than 5 cases | |
27 | into if-else combination would result in a considerable speedup in XDP | |
28 | layer due to avoidance of indirect calls in CONFIG_RETPOLINE enabled | |
29 | builds. On i40e driver with XDP prog attached, a 20-26% speedup has been | |
30 | observed [0]. Aside from XDP, there are many other places later in the | |
31 | networking stack's critical path with similar switch-case | |
32 | processing. Rather than fixing every XDP-enabled driver and locations in | |
33 | stack by hand, it would be good to instead raise the limit where gcc would | |
34 | emit expensive indirect calls from the switch under retpolines and stick | |
35 | with the default as-is in case of !retpoline configured kernels. This would | |
36 | also have the advantage that for archs where this is not necessary, we let | |
37 | compiler select the underlying target optimization for these constructs and | |
38 | avoid potential slow-downs by if-else hand-rewrite. | |
39 | ||
40 | In case of gcc, this setting is controlled by case-values-threshold which | |
41 | has an architecture global default that selects 4 or 5 (latter if target | |
42 | does not have a case insn that compares the bounds) where some arch back | |
43 | ends like arm64 or s390 override it with their own target hooks, for | |
44 | example, in gcc commit db7a90aa0de5 ("S/390: Disable prediction of indirect | |
45 | branches") the threshold pretty much disables jump tables by limit of 20 | |
46 | under retpoline builds. Comparing gcc's and clang's default code | |
47 | generation on x86-64 under O2 level with retpoline build results in the | |
48 | following outcome for 5 switch cases: | |
49 | ||
50 | * gcc with -mindirect-branch=thunk-inline -mindirect-branch-register: | |
51 | ||
52 | # gdb -batch -ex 'disassemble dispatch' ./c-switch | |
53 | Dump of assembler code for function dispatch: | |
54 | 0x0000000000400be0 <+0>: cmp $0x4,%edi | |
55 | 0x0000000000400be3 <+3>: ja 0x400c35 <dispatch+85> | |
56 | 0x0000000000400be5 <+5>: lea 0x915f8(%rip),%rdx # 0x4921e4 | |
57 | 0x0000000000400bec <+12>: mov %edi,%edi | |
58 | 0x0000000000400bee <+14>: movslq (%rdx,%rdi,4),%rax | |
59 | 0x0000000000400bf2 <+18>: add %rdx,%rax | |
60 | 0x0000000000400bf5 <+21>: callq 0x400c01 <dispatch+33> | |
61 | 0x0000000000400bfa <+26>: pause | |
62 | 0x0000000000400bfc <+28>: lfence | |
63 | 0x0000000000400bff <+31>: jmp 0x400bfa <dispatch+26> | |
64 | 0x0000000000400c01 <+33>: mov %rax,(%rsp) | |
65 | 0x0000000000400c05 <+37>: retq | |
66 | 0x0000000000400c06 <+38>: nopw %cs:0x0(%rax,%rax,1) | |
67 | 0x0000000000400c10 <+48>: jmpq 0x400c90 <fn_3> | |
68 | 0x0000000000400c15 <+53>: nopl (%rax) | |
69 | 0x0000000000400c18 <+56>: jmpq 0x400c70 <fn_2> | |
70 | 0x0000000000400c1d <+61>: nopl (%rax) | |
71 | 0x0000000000400c20 <+64>: jmpq 0x400c50 <fn_1> | |
72 | 0x0000000000400c25 <+69>: nopl (%rax) | |
73 | 0x0000000000400c28 <+72>: jmpq 0x400c40 <fn_0> | |
74 | 0x0000000000400c2d <+77>: nopl (%rax) | |
75 | 0x0000000000400c30 <+80>: jmpq 0x400cb0 <fn_4> | |
76 | 0x0000000000400c35 <+85>: push %rax | |
77 | 0x0000000000400c36 <+86>: callq 0x40dd80 <abort> | |
78 | End of assembler dump. | |
79 | ||
80 | * clang with -mretpoline emitting search tree: | |
81 | ||
82 | # gdb -batch -ex 'disassemble dispatch' ./c-switch | |
83 | Dump of assembler code for function dispatch: | |
84 | 0x0000000000400b30 <+0>: cmp $0x1,%edi | |
85 | 0x0000000000400b33 <+3>: jle 0x400b44 <dispatch+20> | |
86 | 0x0000000000400b35 <+5>: cmp $0x2,%edi | |
87 | 0x0000000000400b38 <+8>: je 0x400b4d <dispatch+29> | |
88 | 0x0000000000400b3a <+10>: cmp $0x3,%edi | |
89 | 0x0000000000400b3d <+13>: jne 0x400b52 <dispatch+34> | |
90 | 0x0000000000400b3f <+15>: jmpq 0x400c50 <fn_3> | |
91 | 0x0000000000400b44 <+20>: test %edi,%edi | |
92 | 0x0000000000400b46 <+22>: jne 0x400b5c <dispatch+44> | |
93 | 0x0000000000400b48 <+24>: jmpq 0x400c20 <fn_0> | |
94 | 0x0000000000400b4d <+29>: jmpq 0x400c40 <fn_2> | |
95 | 0x0000000000400b52 <+34>: cmp $0x4,%edi | |
96 | 0x0000000000400b55 <+37>: jne 0x400b66 <dispatch+54> | |
97 | 0x0000000000400b57 <+39>: jmpq 0x400c60 <fn_4> | |
98 | 0x0000000000400b5c <+44>: cmp $0x1,%edi | |
99 | 0x0000000000400b5f <+47>: jne 0x400b66 <dispatch+54> | |
100 | 0x0000000000400b61 <+49>: jmpq 0x400c30 <fn_1> | |
101 | 0x0000000000400b66 <+54>: push %rax | |
102 | 0x0000000000400b67 <+55>: callq 0x40dd20 <abort> | |
103 | End of assembler dump. | |
104 | ||
105 | For sake of comparison, clang without -mretpoline: | |
106 | ||
107 | # gdb -batch -ex 'disassemble dispatch' ./c-switch | |
108 | Dump of assembler code for function dispatch: | |
109 | 0x0000000000400b30 <+0>: cmp $0x4,%edi | |
110 | 0x0000000000400b33 <+3>: ja 0x400b57 <dispatch+39> | |
111 | 0x0000000000400b35 <+5>: mov %edi,%eax | |
112 | 0x0000000000400b37 <+7>: jmpq *0x492148(,%rax,8) | |
113 | 0x0000000000400b3e <+14>: jmpq 0x400bf0 <fn_0> | |
114 | 0x0000000000400b43 <+19>: jmpq 0x400c30 <fn_4> | |
115 | 0x0000000000400b48 <+24>: jmpq 0x400c10 <fn_2> | |
116 | 0x0000000000400b4d <+29>: jmpq 0x400c20 <fn_3> | |
117 | 0x0000000000400b52 <+34>: jmpq 0x400c00 <fn_1> | |
118 | 0x0000000000400b57 <+39>: push %rax | |
119 | 0x0000000000400b58 <+40>: callq 0x40dcf0 <abort> | |
120 | End of assembler dump. | |
121 | ||
122 | Raising the cases to a high number (e.g. 100) will still result in similar | |
123 | code generation pattern with clang and gcc as above, in other words clang | |
124 | generally turns off jump table emission by having an extra expansion pass | |
125 | under retpoline build to turn indirectbr instructions from their IR into | |
126 | switch instructions as a built-in -mno-jump-table lowering of a switch (in | |
127 | this case, even if IR input already contained an indirect branch). | |
128 | ||
129 | For gcc, adding --param=case-values-threshold=20 as in similar fashion as | |
130 | s390 in order to raise the limit for x86 retpoline enabled builds results | |
131 | in a small vmlinux size increase of only 0.13% (before=18,027,528 | |
132 | after=18,051,192). For clang this option is ignored due to i) not being | |
133 | needed as mentioned and ii) not having above cmdline | |
134 | parameter. Non-retpoline-enabled builds with gcc continue to use the | |
135 | default case-values-threshold setting, so nothing changes here. | |
136 | ||
137 | [0] https://lore.kernel.org/netdev/20190129095754.9390-1-bjorn.topel@gmail.com/ | |
138 | and "The Path to DPDK Speeds for AF_XDP", LPC 2018, networking track: | |
139 | - http://vger.kernel.org/lpc_net2018_talks/lpc18_pres_af_xdp_perf-v3.pdf | |
140 | - http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf | |
141 | ||
142 | Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> | |
143 | Signed-off-by: Thomas Gleixner <tglx@linutronix.de> | |
144 | Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> | |
145 | Acked-by: Björn Töpel <bjorn.topel@intel.com> | |
146 | Acked-by: Linus Torvalds <torvalds@linux-foundation.org> | |
147 | Cc: netdev@vger.kernel.org | |
148 | Cc: David S. Miller <davem@davemloft.net> | |
149 | Cc: Magnus Karlsson <magnus.karlsson@intel.com> | |
150 | Cc: Alexei Starovoitov <ast@kernel.org> | |
151 | Cc: Peter Zijlstra <peterz@infradead.org> | |
152 | Cc: David Woodhouse <dwmw2@infradead.org> | |
153 | Cc: Andy Lutomirski <luto@kernel.org> | |
154 | Cc: Borislav Petkov <bp@alien8.de> | |
155 | Link: https://lkml.kernel.org/r/20190221221941.29358-1-daniel@iogearbox.net | |
156 | Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> | |
157 | ||
158 | --- | |
159 | arch/x86/Makefile | 5 +++++ | |
160 | 1 file changed, 5 insertions(+) | |
161 | ||
162 | --- a/arch/x86/Makefile | |
163 | +++ b/arch/x86/Makefile | |
164 | @@ -217,6 +217,11 @@ KBUILD_CFLAGS += -fno-asynchronous-unwin | |
165 | # Avoid indirect branches in kernel to deal with Spectre | |
166 | ifdef CONFIG_RETPOLINE | |
167 | KBUILD_CFLAGS += $(RETPOLINE_CFLAGS) | |
168 | + # Additionally, avoid generating expensive indirect jumps which | |
169 | + # are subject to retpolines for small number of switch cases. | |
170 | + # clang turns off jump table generation by default when under | |
171 | + # retpoline builds, however, gcc does not for x86. | |
172 | + KBUILD_CFLAGS += $(call cc-option,--param=case-values-threshold=20) | |
173 | endif | |
174 | ||
175 | archscripts: scripts_basic |