]> git.ipfire.org Git - thirdparty/kernel/stable-queue.git/blame - releases/5.0.11/x86-retpolines-raise-limit-for-generating-indirect-calls-from-switch-case.patch
4.14-stable patches
[thirdparty/kernel/stable-queue.git] / releases / 5.0.11 / x86-retpolines-raise-limit-for-generating-indirect-calls-from-switch-case.patch
CommitLineData
088b31d9
GKH
1From ce02ef06fcf7a399a6276adb83f37373d10cbbe1 Mon Sep 17 00:00:00 2001
2From: Daniel Borkmann <daniel@iogearbox.net>
3Date: Thu, 21 Feb 2019 23:19:41 +0100
4Subject: x86, retpolines: Raise limit for generating indirect calls from switch-case
5MIME-Version: 1.0
6Content-Type: text/plain; charset=UTF-8
7Content-Transfer-Encoding: 8bit
8
9From: Daniel Borkmann <daniel@iogearbox.net>
10
11commit ce02ef06fcf7a399a6276adb83f37373d10cbbe1 upstream.
12
13From networking side, there are numerous attempts to get rid of indirect
14calls in fast-path wherever feasible in order to avoid the cost of
15retpolines, for example, just to name a few:
16
17 * 283c16a2dfd3 ("indirect call wrappers: helpers to speed-up indirect calls of builtin")
18 * aaa5d90b395a ("net: use indirect call wrappers at GRO network layer")
19 * 028e0a476684 ("net: use indirect call wrappers at GRO transport layer")
20 * 356da6d0cde3 ("dma-mapping: bypass indirect calls for dma-direct")
21 * 09772d92cd5a ("bpf: avoid retpoline for lookup/update/delete calls on maps")
22 * 10870dd89e95 ("netfilter: nf_tables: add direct calls for all builtin expressions")
23 [...]
24
25Recent work on XDP from Björn and Magnus additionally found that manually
26transforming the XDP return code switch statement with more than 5 cases
27into if-else combination would result in a considerable speedup in XDP
28layer due to avoidance of indirect calls in CONFIG_RETPOLINE enabled
29builds. On i40e driver with XDP prog attached, a 20-26% speedup has been
30observed [0]. Aside from XDP, there are many other places later in the
31networking stack's critical path with similar switch-case
32processing. Rather than fixing every XDP-enabled driver and locations in
33stack by hand, it would be good to instead raise the limit where gcc would
34emit expensive indirect calls from the switch under retpolines and stick
35with the default as-is in case of !retpoline configured kernels. This would
36also have the advantage that for archs where this is not necessary, we let
37compiler select the underlying target optimization for these constructs and
38avoid potential slow-downs by if-else hand-rewrite.
39
40In case of gcc, this setting is controlled by case-values-threshold which
41has an architecture global default that selects 4 or 5 (latter if target
42does not have a case insn that compares the bounds) where some arch back
43ends like arm64 or s390 override it with their own target hooks, for
44example, in gcc commit db7a90aa0de5 ("S/390: Disable prediction of indirect
45branches") the threshold pretty much disables jump tables by limit of 20
46under retpoline builds. Comparing gcc's and clang's default code
47generation on x86-64 under O2 level with retpoline build results in the
48following outcome for 5 switch cases:
49
50* gcc with -mindirect-branch=thunk-inline -mindirect-branch-register:
51
52 # gdb -batch -ex 'disassemble dispatch' ./c-switch
53 Dump of assembler code for function dispatch:
54 0x0000000000400be0 <+0>: cmp $0x4,%edi
55 0x0000000000400be3 <+3>: ja 0x400c35 <dispatch+85>
56 0x0000000000400be5 <+5>: lea 0x915f8(%rip),%rdx # 0x4921e4
57 0x0000000000400bec <+12>: mov %edi,%edi
58 0x0000000000400bee <+14>: movslq (%rdx,%rdi,4),%rax
59 0x0000000000400bf2 <+18>: add %rdx,%rax
60 0x0000000000400bf5 <+21>: callq 0x400c01 <dispatch+33>
61 0x0000000000400bfa <+26>: pause
62 0x0000000000400bfc <+28>: lfence
63 0x0000000000400bff <+31>: jmp 0x400bfa <dispatch+26>
64 0x0000000000400c01 <+33>: mov %rax,(%rsp)
65 0x0000000000400c05 <+37>: retq
66 0x0000000000400c06 <+38>: nopw %cs:0x0(%rax,%rax,1)
67 0x0000000000400c10 <+48>: jmpq 0x400c90 <fn_3>
68 0x0000000000400c15 <+53>: nopl (%rax)
69 0x0000000000400c18 <+56>: jmpq 0x400c70 <fn_2>
70 0x0000000000400c1d <+61>: nopl (%rax)
71 0x0000000000400c20 <+64>: jmpq 0x400c50 <fn_1>
72 0x0000000000400c25 <+69>: nopl (%rax)
73 0x0000000000400c28 <+72>: jmpq 0x400c40 <fn_0>
74 0x0000000000400c2d <+77>: nopl (%rax)
75 0x0000000000400c30 <+80>: jmpq 0x400cb0 <fn_4>
76 0x0000000000400c35 <+85>: push %rax
77 0x0000000000400c36 <+86>: callq 0x40dd80 <abort>
78 End of assembler dump.
79
80* clang with -mretpoline emitting search tree:
81
82 # gdb -batch -ex 'disassemble dispatch' ./c-switch
83 Dump of assembler code for function dispatch:
84 0x0000000000400b30 <+0>: cmp $0x1,%edi
85 0x0000000000400b33 <+3>: jle 0x400b44 <dispatch+20>
86 0x0000000000400b35 <+5>: cmp $0x2,%edi
87 0x0000000000400b38 <+8>: je 0x400b4d <dispatch+29>
88 0x0000000000400b3a <+10>: cmp $0x3,%edi
89 0x0000000000400b3d <+13>: jne 0x400b52 <dispatch+34>
90 0x0000000000400b3f <+15>: jmpq 0x400c50 <fn_3>
91 0x0000000000400b44 <+20>: test %edi,%edi
92 0x0000000000400b46 <+22>: jne 0x400b5c <dispatch+44>
93 0x0000000000400b48 <+24>: jmpq 0x400c20 <fn_0>
94 0x0000000000400b4d <+29>: jmpq 0x400c40 <fn_2>
95 0x0000000000400b52 <+34>: cmp $0x4,%edi
96 0x0000000000400b55 <+37>: jne 0x400b66 <dispatch+54>
97 0x0000000000400b57 <+39>: jmpq 0x400c60 <fn_4>
98 0x0000000000400b5c <+44>: cmp $0x1,%edi
99 0x0000000000400b5f <+47>: jne 0x400b66 <dispatch+54>
100 0x0000000000400b61 <+49>: jmpq 0x400c30 <fn_1>
101 0x0000000000400b66 <+54>: push %rax
102 0x0000000000400b67 <+55>: callq 0x40dd20 <abort>
103 End of assembler dump.
104
105 For sake of comparison, clang without -mretpoline:
106
107 # gdb -batch -ex 'disassemble dispatch' ./c-switch
108 Dump of assembler code for function dispatch:
109 0x0000000000400b30 <+0>: cmp $0x4,%edi
110 0x0000000000400b33 <+3>: ja 0x400b57 <dispatch+39>
111 0x0000000000400b35 <+5>: mov %edi,%eax
112 0x0000000000400b37 <+7>: jmpq *0x492148(,%rax,8)
113 0x0000000000400b3e <+14>: jmpq 0x400bf0 <fn_0>
114 0x0000000000400b43 <+19>: jmpq 0x400c30 <fn_4>
115 0x0000000000400b48 <+24>: jmpq 0x400c10 <fn_2>
116 0x0000000000400b4d <+29>: jmpq 0x400c20 <fn_3>
117 0x0000000000400b52 <+34>: jmpq 0x400c00 <fn_1>
118 0x0000000000400b57 <+39>: push %rax
119 0x0000000000400b58 <+40>: callq 0x40dcf0 <abort>
120 End of assembler dump.
121
122Raising the cases to a high number (e.g. 100) will still result in similar
123code generation pattern with clang and gcc as above, in other words clang
124generally turns off jump table emission by having an extra expansion pass
125under retpoline build to turn indirectbr instructions from their IR into
126switch instructions as a built-in -mno-jump-table lowering of a switch (in
127this case, even if IR input already contained an indirect branch).
128
129For gcc, adding --param=case-values-threshold=20 as in similar fashion as
130s390 in order to raise the limit for x86 retpoline enabled builds results
131in a small vmlinux size increase of only 0.13% (before=18,027,528
132after=18,051,192). For clang this option is ignored due to i) not being
133needed as mentioned and ii) not having above cmdline
134parameter. Non-retpoline-enabled builds with gcc continue to use the
135default case-values-threshold setting, so nothing changes here.
136
137[0] https://lore.kernel.org/netdev/20190129095754.9390-1-bjorn.topel@gmail.com/
138 and "The Path to DPDK Speeds for AF_XDP", LPC 2018, networking track:
139 - http://vger.kernel.org/lpc_net2018_talks/lpc18_pres_af_xdp_perf-v3.pdf
140 - http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
141
142Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
143Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
144Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
145Acked-by: Björn Töpel <bjorn.topel@intel.com>
146Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
147Cc: netdev@vger.kernel.org
148Cc: David S. Miller <davem@davemloft.net>
149Cc: Magnus Karlsson <magnus.karlsson@intel.com>
150Cc: Alexei Starovoitov <ast@kernel.org>
151Cc: Peter Zijlstra <peterz@infradead.org>
152Cc: David Woodhouse <dwmw2@infradead.org>
153Cc: Andy Lutomirski <luto@kernel.org>
154Cc: Borislav Petkov <bp@alien8.de>
155Link: https://lkml.kernel.org/r/20190221221941.29358-1-daniel@iogearbox.net
156Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
157
158---
159 arch/x86/Makefile | 5 +++++
160 1 file changed, 5 insertions(+)
161
162--- a/arch/x86/Makefile
163+++ b/arch/x86/Makefile
164@@ -217,6 +217,11 @@ KBUILD_CFLAGS += -fno-asynchronous-unwin
165 # Avoid indirect branches in kernel to deal with Spectre
166 ifdef CONFIG_RETPOLINE
167 KBUILD_CFLAGS += $(RETPOLINE_CFLAGS)
168+ # Additionally, avoid generating expensive indirect jumps which
169+ # are subject to retpolines for small number of switch cases.
170+ # clang turns off jump table generation by default when under
171+ # retpoline builds, however, gcc does not for x86.
172+ KBUILD_CFLAGS += $(call cc-option,--param=case-values-threshold=20)
173 endif
174
175 archscripts: scripts_basic