]> git.ipfire.org Git - thirdparty/kernel/stable-queue.git/blob - releases/6.6.26/i40e-enforce-software-interrupt-during-busy-poll-exi.patch
Linux 6.1.85
[thirdparty/kernel/stable-queue.git] / releases / 6.6.26 / i40e-enforce-software-interrupt-during-busy-poll-exi.patch
1 From 0be173a5b3d5ff5e347c7d981627fc71640129ed Mon Sep 17 00:00:00 2001
2 From: Sasha Levin <sashal@kernel.org>
3 Date: Sat, 16 Mar 2024 12:38:29 +0100
4 Subject: i40e: Enforce software interrupt during busy-poll exit
5
6 From: Ivan Vecera <ivecera@redhat.com>
7
8 [ Upstream commit ea558de7238bb12c3435c47f0631e9d17bf4a09f ]
9
10 As for ice bug fixed by commit b7306b42beaf ("ice: manage interrupts
11 during poll exit") followed by commit 23be7075b318 ("ice: fix software
12 generating extra interrupts") I'm seeing the similar issue also with
13 i40e driver.
14
15 In certain situation when busy-loop is enabled together with adaptive
16 coalescing, the driver occasionally misses that there are outstanding
17 descriptors to clean when exiting busy poll.
18
19 Try to catch the remaining work by triggering a software interrupt
20 when exiting busy poll. No extra interrupts will be generated when
21 busy polling is not used.
22
23 The issue was found when running sockperf ping-pong tcp test with
24 adaptive coalescing and busy poll enabled (50 as value busy_pool
25 and busy_read sysctl knobs) and results in huge latency spikes
26 with more than 100000us.
27
28 The fix is inspired from the ice driver and do the following:
29 1) During napi poll exit in case of busy-poll (napo_complete_done()
30 returns false) this is recorded to q_vector that we were in busy
31 loop.
32 2) Extends i40e_buildreg_itr() to be able to add an enforced software
33 interrupt into built value
34 2) In i40e_update_enable_itr() enforces a software interrupt trigger
35 if we are exiting busy poll to catch any pending clean-ups
36 3) Reuses unused 3rd ITR (interrupt throttle) index and set it to
37 20K interrupts per second to limit the number of these sw interrupts.
38
39 Test results
40 ============
41 Prior:
42 [root@dell-per640-07 net]# sockperf ping-pong -i 10.9.9.1 --tcp -m 1000 --mps=max -t 120
43 sockperf: == version #3.10-no.git ==
44 sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)
45
46 [ 0] IP = 10.9.9.1 PORT = 11111 # TCP
47 sockperf: Warmup stage (sending a few dummy messages)...
48 sockperf: Starting test...
49 sockperf: Test end (interrupted by timer)
50 sockperf: Test ended
51 sockperf: [Total Run] RunTime=119.999 sec; Warm up time=400 msec; SentMessages=2438563; ReceivedMessages=2438562
52 sockperf: ========= Printing statistics for Server No: 0
53 sockperf: [Valid Duration] RunTime=119.549 sec; SentMessages=2429473; ReceivedMessages=2429473
54 sockperf: ====> avg-latency=24.571 (std-dev=93.297, mean-ad=4.904, median-ad=1.510, siqr=1.063, cv=3.797, std-error=0.060, 99.0% ci=[24.417, 24.725])
55 sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
56 sockperf: Summary: Latency is 24.571 usec
57 sockperf: Total 2429473 observations; each percentile contains 24294.73 observations
58 sockperf: ---> <MAX> observation = 103294.331
59 sockperf: ---> percentile 99.999 = 45.633
60 sockperf: ---> percentile 99.990 = 37.013
61 sockperf: ---> percentile 99.900 = 35.910
62 sockperf: ---> percentile 99.000 = 33.390
63 sockperf: ---> percentile 90.000 = 28.626
64 sockperf: ---> percentile 75.000 = 27.741
65 sockperf: ---> percentile 50.000 = 26.743
66 sockperf: ---> percentile 25.000 = 25.614
67 sockperf: ---> <MIN> observation = 12.220
68
69 After:
70 [root@dell-per640-07 net]# sockperf ping-pong -i 10.9.9.1 --tcp -m 1000 --mps=max -t 120
71 sockperf: == version #3.10-no.git ==
72 sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)
73
74 [ 0] IP = 10.9.9.1 PORT = 11111 # TCP
75 sockperf: Warmup stage (sending a few dummy messages)...
76 sockperf: Starting test...
77 sockperf: Test end (interrupted by timer)
78 sockperf: Test ended
79 sockperf: [Total Run] RunTime=119.999 sec; Warm up time=400 msec; SentMessages=2400055; ReceivedMessages=2400054
80 sockperf: ========= Printing statistics for Server No: 0
81 sockperf: [Valid Duration] RunTime=119.549 sec; SentMessages=2391186; ReceivedMessages=2391186
82 sockperf: ====> avg-latency=24.965 (std-dev=5.934, mean-ad=4.642, median-ad=1.485, siqr=1.067, cv=0.238, std-error=0.004, 99.0% ci=[24.955, 24.975])
83 sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
84 sockperf: Summary: Latency is 24.965 usec
85 sockperf: Total 2391186 observations; each percentile contains 23911.86 observations
86 sockperf: ---> <MAX> observation = 195.841
87 sockperf: ---> percentile 99.999 = 45.026
88 sockperf: ---> percentile 99.990 = 39.009
89 sockperf: ---> percentile 99.900 = 35.922
90 sockperf: ---> percentile 99.000 = 33.482
91 sockperf: ---> percentile 90.000 = 28.902
92 sockperf: ---> percentile 75.000 = 27.821
93 sockperf: ---> percentile 50.000 = 26.860
94 sockperf: ---> percentile 25.000 = 25.685
95 sockperf: ---> <MIN> observation = 12.277
96
97 Fixes: 0bcd952feec7 ("ethernet/intel: consolidate NAPI and NAPI exit")
98 Reported-by: Hugo Ferreira <hferreir@redhat.com>
99 Reviewed-by: Michal Schmidt <mschmidt@redhat.com>
100 Signed-off-by: Ivan Vecera <ivecera@redhat.com>
101 Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
102 Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
103 Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
104 Signed-off-by: Sasha Levin <sashal@kernel.org>
105 ---
106 drivers/net/ethernet/intel/i40e/i40e.h | 1 +
107 drivers/net/ethernet/intel/i40e/i40e_main.c | 6 ++
108 .../net/ethernet/intel/i40e/i40e_register.h | 3 +
109 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 82 ++++++++++++++-----
110 drivers/net/ethernet/intel/i40e/i40e_txrx.h | 1 +
111 5 files changed, 72 insertions(+), 21 deletions(-)
112
113 diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
114 index bc353da3ed41d..3cc0b87def3fa 100644
115 --- a/drivers/net/ethernet/intel/i40e/i40e.h
116 +++ b/drivers/net/ethernet/intel/i40e/i40e.h
117 @@ -992,6 +992,7 @@ struct i40e_q_vector {
118 struct rcu_head rcu; /* to avoid race with update stats on free */
119 char name[I40E_INT_NAME_STR_LEN];
120 bool arm_wb_state;
121 + bool in_busy_poll;
122 int irq_num; /* IRQ assigned to this q_vector */
123 } ____cacheline_internodealigned_in_smp;
124
125 diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
126 index fd4e86b6b4c1f..8bfecf81d26f6 100644
127 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
128 +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
129 @@ -3908,6 +3908,12 @@ static void i40e_vsi_configure_msix(struct i40e_vsi *vsi)
130 q_vector->tx.target_itr >> 1);
131 q_vector->tx.current_itr = q_vector->tx.target_itr;
132
133 + /* Set ITR for software interrupts triggered after exiting
134 + * busy-loop polling.
135 + */
136 + wr32(hw, I40E_PFINT_ITRN(I40E_SW_ITR, vector - 1),
137 + I40E_ITR_20K);
138 +
139 wr32(hw, I40E_PFINT_RATEN(vector - 1),
140 i40e_intrl_usec_to_reg(vsi->int_rate_limit));
141
142 diff --git a/drivers/net/ethernet/intel/i40e/i40e_register.h b/drivers/net/ethernet/intel/i40e/i40e_register.h
143 index 7339003aa17cd..694cb3e45c1ec 100644
144 --- a/drivers/net/ethernet/intel/i40e/i40e_register.h
145 +++ b/drivers/net/ethernet/intel/i40e/i40e_register.h
146 @@ -328,8 +328,11 @@
147 #define I40E_PFINT_DYN_CTLN_ITR_INDX_SHIFT 3
148 #define I40E_PFINT_DYN_CTLN_ITR_INDX_MASK I40E_MASK(0x3, I40E_PFINT_DYN_CTLN_ITR_INDX_SHIFT)
149 #define I40E_PFINT_DYN_CTLN_INTERVAL_SHIFT 5
150 +#define I40E_PFINT_DYN_CTLN_INTERVAL_MASK I40E_MASK(0xFFF, I40E_PFINT_DYN_CTLN_INTERVAL_SHIFT)
151 #define I40E_PFINT_DYN_CTLN_SW_ITR_INDX_ENA_SHIFT 24
152 #define I40E_PFINT_DYN_CTLN_SW_ITR_INDX_ENA_MASK I40E_MASK(0x1, I40E_PFINT_DYN_CTLN_SW_ITR_INDX_ENA_SHIFT)
153 +#define I40E_PFINT_DYN_CTLN_SW_ITR_INDX_SHIFT 25
154 +#define I40E_PFINT_DYN_CTLN_SW_ITR_INDX_MASK I40E_MASK(0x3, I40E_PFINT_DYN_CTLN_SW_ITR_INDX_SHIFT)
155 #define I40E_PFINT_ICR0 0x00038780 /* Reset: CORER */
156 #define I40E_PFINT_ICR0_INTEVENT_SHIFT 0
157 #define I40E_PFINT_ICR0_INTEVENT_MASK I40E_MASK(0x1, I40E_PFINT_ICR0_INTEVENT_SHIFT)
158 diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
159 index 1df2f93388128..f703646622d9a 100644
160 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
161 +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
162 @@ -2644,7 +2644,22 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget,
163 return failure ? budget : (int)total_rx_packets;
164 }
165
166 -static inline u32 i40e_buildreg_itr(const int type, u16 itr)
167 +/**
168 + * i40e_buildreg_itr - build a value for writing to I40E_PFINT_DYN_CTLN register
169 + * @itr_idx: interrupt throttling index
170 + * @interval: interrupt throttling interval value in usecs
171 + * @force_swint: force software interrupt
172 + *
173 + * The function builds a value for I40E_PFINT_DYN_CTLN register that
174 + * is used to update interrupt throttling interval for specified ITR index
175 + * and optionally enforces a software interrupt. If the @itr_idx is equal
176 + * to I40E_ITR_NONE then no interval change is applied and only @force_swint
177 + * parameter is taken into account. If the interval change and enforced
178 + * software interrupt are not requested then the built value just enables
179 + * appropriate vector interrupt.
180 + **/
181 +static u32 i40e_buildreg_itr(enum i40e_dyn_idx itr_idx, u16 interval,
182 + bool force_swint)
183 {
184 u32 val;
185
186 @@ -2658,23 +2673,33 @@ static inline u32 i40e_buildreg_itr(const int type, u16 itr)
187 * an event in the PBA anyway so we need to rely on the automask
188 * to hold pending events for us until the interrupt is re-enabled
189 *
190 - * The itr value is reported in microseconds, and the register
191 - * value is recorded in 2 microsecond units. For this reason we
192 - * only need to shift by the interval shift - 1 instead of the
193 - * full value.
194 + * We have to shift the given value as it is reported in microseconds
195 + * and the register value is recorded in 2 microsecond units.
196 */
197 - itr &= I40E_ITR_MASK;
198 + interval >>= 1;
199
200 + /* 1. Enable vector interrupt
201 + * 2. Update the interval for the specified ITR index
202 + * (I40E_ITR_NONE in the register is used to indicate that
203 + * no interval update is requested)
204 + */
205 val = I40E_PFINT_DYN_CTLN_INTENA_MASK |
206 - (type << I40E_PFINT_DYN_CTLN_ITR_INDX_SHIFT) |
207 - (itr << (I40E_PFINT_DYN_CTLN_INTERVAL_SHIFT - 1));
208 + FIELD_PREP(I40E_PFINT_DYN_CTLN_ITR_INDX_MASK, itr_idx) |
209 + FIELD_PREP(I40E_PFINT_DYN_CTLN_INTERVAL_MASK, interval);
210 +
211 + /* 3. Enforce software interrupt trigger if requested
212 + * (These software interrupts rate is limited by ITR2 that is
213 + * set to 20K interrupts per second)
214 + */
215 + if (force_swint)
216 + val |= I40E_PFINT_DYN_CTLN_SWINT_TRIG_MASK |
217 + I40E_PFINT_DYN_CTLN_SW_ITR_INDX_ENA_MASK |
218 + FIELD_PREP(I40E_PFINT_DYN_CTLN_SW_ITR_INDX_MASK,
219 + I40E_SW_ITR);
220
221 return val;
222 }
223
224 -/* a small macro to shorten up some long lines */
225 -#define INTREG I40E_PFINT_DYN_CTLN
226 -
227 /* The act of updating the ITR will cause it to immediately trigger. In order
228 * to prevent this from throwing off adaptive update statistics we defer the
229 * update so that it can only happen so often. So after either Tx or Rx are
230 @@ -2693,8 +2718,10 @@ static inline u32 i40e_buildreg_itr(const int type, u16 itr)
231 static inline void i40e_update_enable_itr(struct i40e_vsi *vsi,
232 struct i40e_q_vector *q_vector)
233 {
234 + enum i40e_dyn_idx itr_idx = I40E_ITR_NONE;
235 struct i40e_hw *hw = &vsi->back->hw;
236 - u32 intval;
237 + u16 interval = 0;
238 + u32 itr_val;
239
240 /* If we don't have MSIX, then we only need to re-enable icr0 */
241 if (!(vsi->back->flags & I40E_FLAG_MSIX_ENABLED)) {
242 @@ -2716,8 +2743,8 @@ static inline void i40e_update_enable_itr(struct i40e_vsi *vsi,
243 */
244 if (q_vector->rx.target_itr < q_vector->rx.current_itr) {
245 /* Rx ITR needs to be reduced, this is highest priority */
246 - intval = i40e_buildreg_itr(I40E_RX_ITR,
247 - q_vector->rx.target_itr);
248 + itr_idx = I40E_RX_ITR;
249 + interval = q_vector->rx.target_itr;
250 q_vector->rx.current_itr = q_vector->rx.target_itr;
251 q_vector->itr_countdown = ITR_COUNTDOWN_START;
252 } else if ((q_vector->tx.target_itr < q_vector->tx.current_itr) ||
253 @@ -2726,25 +2753,36 @@ static inline void i40e_update_enable_itr(struct i40e_vsi *vsi,
254 /* Tx ITR needs to be reduced, this is second priority
255 * Tx ITR needs to be increased more than Rx, fourth priority
256 */
257 - intval = i40e_buildreg_itr(I40E_TX_ITR,
258 - q_vector->tx.target_itr);
259 + itr_idx = I40E_TX_ITR;
260 + interval = q_vector->tx.target_itr;
261 q_vector->tx.current_itr = q_vector->tx.target_itr;
262 q_vector->itr_countdown = ITR_COUNTDOWN_START;
263 } else if (q_vector->rx.current_itr != q_vector->rx.target_itr) {
264 /* Rx ITR needs to be increased, third priority */
265 - intval = i40e_buildreg_itr(I40E_RX_ITR,
266 - q_vector->rx.target_itr);
267 + itr_idx = I40E_RX_ITR;
268 + interval = q_vector->rx.target_itr;
269 q_vector->rx.current_itr = q_vector->rx.target_itr;
270 q_vector->itr_countdown = ITR_COUNTDOWN_START;
271 } else {
272 /* No ITR update, lowest priority */
273 - intval = i40e_buildreg_itr(I40E_ITR_NONE, 0);
274 if (q_vector->itr_countdown)
275 q_vector->itr_countdown--;
276 }
277
278 - if (!test_bit(__I40E_VSI_DOWN, vsi->state))
279 - wr32(hw, INTREG(q_vector->reg_idx), intval);
280 + /* Do not update interrupt control register if VSI is down */
281 + if (test_bit(__I40E_VSI_DOWN, vsi->state))
282 + return;
283 +
284 + /* Update ITR interval if necessary and enforce software interrupt
285 + * if we are exiting busy poll.
286 + */
287 + if (q_vector->in_busy_poll) {
288 + itr_val = i40e_buildreg_itr(itr_idx, interval, true);
289 + q_vector->in_busy_poll = false;
290 + } else {
291 + itr_val = i40e_buildreg_itr(itr_idx, interval, false);
292 + }
293 + wr32(hw, I40E_PFINT_DYN_CTLN(q_vector->reg_idx), itr_val);
294 }
295
296 /**
297 @@ -2859,6 +2897,8 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
298 */
299 if (likely(napi_complete_done(napi, work_done)))
300 i40e_update_enable_itr(vsi, q_vector);
301 + else
302 + q_vector->in_busy_poll = true;
303
304 return min(work_done, budget - 1);
305 }
306 diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
307 index 84e4dacde6f58..81f6a991bfb73 100644
308 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
309 +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
310 @@ -67,6 +67,7 @@ enum i40e_dyn_idx {
311 /* these are indexes into ITRN registers */
312 #define I40E_RX_ITR I40E_IDX_ITR0
313 #define I40E_TX_ITR I40E_IDX_ITR1
314 +#define I40E_SW_ITR I40E_IDX_ITR2
315
316 /* Supported RSS offloads */
317 #define I40E_DEFAULT_RSS_HENA ( \
318 --
319 2.43.0
320