RDMA/rtrs-srv: Add check and closure for possible zombie paths
During several network incidents, a number of RTRS paths for a session
went through disconnect and reconnect phase. However, some of those did
not auto-reconnect successfully. Instead they failed with the following
logs,
On client,
kernel: rtrs_client L1991: <sess-name>: Connect rejected: status 28
(consumer defined), rtrs errno -104
kernel: rtrs_client L2698: <sess-name>: init_conns() failed: err=-104
path=gid:<gid1>@gid:<gid2> [mlx4_0:1]
On server, (log a)
kernel: ibtrs_server L1868: <>: Connection already exists: 0
When the misbehaving path was removed, and add_path was called to re-add
the path, the log on client side changed to, (log b)
kernel: rtrs_client L1991: <sess-name>: Connect rejected: status 28
(consumer defined), rtrs errno -17
There was no log on the server side for this, which is expected since
there is no logging in that path,
if (unlikely(__is_path_w_addr_exists(srv, &cm_id->route.addr))) {
err = -EEXIST;
goto err;
Because of the following check on server side,
if (unlikely(sess->state != IBTRS_SRV_CONNECTING)) {
ibtrs_err(s, "Session in wrong state: %s\n",
.. we know that the path in (log a) was in CONNECTING state.
The above state of the path persists for as long as we leave the session
be. This means that the path is in some zombie state, probably waiting
for the info_req packet to arrive, which never does.
The changes in this commits does 2 things.
1) Add logs at places where we see the errors happening. The logs would
shed more light at the state and lifetime of such zombie paths.
2) Close such zombie sessions, only if they are in CONNECTING state, and
after an inactivity period of 30 seconds.
i) The state check prevents closure of paths which are CONNECTED.
Also, from the above logs and code, we already know that the path could
only be on CONNECTING state, so we play safe and narrow our impact surface
area by closing only CONNECTING paths.
ii) The inactivity period is to allow requests for other cid to finish
processing, or for any stray packets to arrive/fail.
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20260107161517.56357-7-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>