]>
Commit | Line | Data |
---|---|---|
91861a69 SL |
1 | From 50dbe9581bfed964d259f0938d8f7b8f3def3c7b Mon Sep 17 00:00:00 2001 |
2 | From: =?UTF-8?q?H=C3=A5kon=20Bugge?= <haakon.bugge@oracle.com> | |
3 | Date: Sun, 17 Feb 2019 15:45:12 +0100 | |
4 | Subject: IB/mlx4: Increase the timeout for CM cache | |
5 | MIME-Version: 1.0 | |
6 | Content-Type: text/plain; charset=UTF-8 | |
7 | Content-Transfer-Encoding: 8bit | |
8 | ||
9 | [ Upstream commit 2612d723aadcf8281f9bf8305657129bd9f3cd57 ] | |
10 | ||
11 | Using CX-3 virtual functions, either from a bare-metal machine or | |
12 | pass-through from a VM, MAD packets are proxied through the PF driver. | |
13 | ||
14 | Since the VF drivers have separate name spaces for MAD Transaction Ids | |
15 | (TIDs), the PF driver has to re-map the TIDs and keep the book keeping | |
16 | in a cache. | |
17 | ||
18 | Following the RDMA Connection Manager (CM) protocol, it is clear when | |
19 | an entry has to evicted form the cache. But life is not perfect, | |
20 | remote peers may die or be rebooted. Hence, it's a timeout to wipe out | |
21 | a cache entry, when the PF driver assumes the remote peer has gone. | |
22 | ||
23 | During workloads where a high number of QPs are destroyed concurrently, | |
24 | excessive amount of CM DREQ retries has been observed | |
25 | ||
26 | The problem can be demonstrated in a bare-metal environment, where two | |
27 | nodes have instantiated 8 VFs each. This using dual ported HCAs, so we | |
28 | have 16 vPorts per physical server. | |
29 | ||
30 | 64 processes are associated with each vPort and creates and destroys | |
31 | one QP for each of the remote 64 processes. That is, 1024 QPs per | |
32 | vPort, all in all 16K QPs. The QPs are created/destroyed using the | |
33 | CM. | |
34 | ||
35 | When tearing down these 16K QPs, excessive CM DREQ retries (and | |
36 | duplicates) are observed. With some cat/paste/awk wizardry on the | |
37 | infiniband_cm sysfs, we observe as sum of the 16 vPorts on one of the | |
38 | nodes: | |
39 | ||
40 | cm_rx_duplicates: | |
41 | dreq 2102 | |
42 | cm_rx_msgs: | |
43 | drep 1989 | |
44 | dreq 6195 | |
45 | rep 3968 | |
46 | req 4224 | |
47 | rtu 4224 | |
48 | cm_tx_msgs: | |
49 | drep 4093 | |
50 | dreq 27568 | |
51 | rep 4224 | |
52 | req 3968 | |
53 | rtu 3968 | |
54 | cm_tx_retries: | |
55 | dreq 23469 | |
56 | ||
57 | Note that the active/passive side is equally distributed between the | |
58 | two nodes. | |
59 | ||
60 | Enabling pr_debug in cm.c gives tons of: | |
61 | ||
62 | [171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave: | |
63 | 1,sl_cm_id: 0xd393089f} is NULL! | |
64 | ||
65 | By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the | |
66 | tear-down phase of the application is reduced from approximately 90 to | |
67 | 50 seconds. Retries/duplicates are also significantly reduced: | |
68 | ||
69 | cm_rx_duplicates: | |
70 | dreq 2460 | |
71 | [] | |
72 | cm_tx_retries: | |
73 | dreq 3010 | |
74 | req 47 | |
75 | ||
76 | Increasing the timeout further didn't help, as these duplicates and | |
77 | retries stems from a too short CMA timeout, which was 20 (~4 seconds) | |
78 | on the systems. By increasing the CMA timeout to 22 (~17 seconds), the | |
79 | numbers fell down to about 10 for both of them. | |
80 | ||
81 | Adjustment of the CMA timeout is not part of this commit. | |
82 | ||
83 | Signed-off-by: HÃ¥kon Bugge <haakon.bugge@oracle.com> | |
84 | Acked-by: Jack Morgenstein <jackm@dev.mellanox.co.il> | |
85 | Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> | |
86 | Signed-off-by: Sasha Levin <sashal@kernel.org> | |
87 | --- | |
88 | drivers/infiniband/hw/mlx4/cm.c | 2 +- | |
89 | 1 file changed, 1 insertion(+), 1 deletion(-) | |
90 | ||
91 | diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c | |
92 | index 39a488889fc7..5dc920fe1326 100644 | |
93 | --- a/drivers/infiniband/hw/mlx4/cm.c | |
94 | +++ b/drivers/infiniband/hw/mlx4/cm.c | |
95 | @@ -39,7 +39,7 @@ | |
96 | ||
97 | #include "mlx4_ib.h" | |
98 | ||
99 | -#define CM_CLEANUP_CACHE_TIMEOUT (5 * HZ) | |
100 | +#define CM_CLEANUP_CACHE_TIMEOUT (30 * HZ) | |
101 | ||
102 | struct id_map_entry { | |
103 | struct rb_node node; | |
104 | -- | |
105 | 2.19.1 | |
106 |