Clear hca_devcom_comp in device's private data after unregistering it in
LAG teardown. Otherwise a slightly lagging second pass through
mlx5_unload_one() might try to unregister it again and trip over
use-after-free.
On s390 almost all PCI level recovery events trigger two passes through
mxl5_unload_one() - one through the poll_health() method and one through
mlx5_pci_err_detected() as callback from generic PCI error recovery.
While testing PCI error recovery paths with more kernel debug features
enabled, this issue reproducibly led to kernel panics with the following
call chain:
Unable to handle kernel pointer dereference in virtual kernel address space
Failing address:
6b6b6b6b6b6b6000 TEID:
6b6b6b6b6b6b6803 ESOP-2 FSI
Fault in home space mode while using kernel ASCE.
AS:
00000000705c4007 R3:
0000000000000024
Oops: 0038 ilc:3 [#1]SMP
CPU: 14 UID: 0 PID: 156 Comm: kmcheck Kdump: loaded Not tainted
6.18.0-
20251130.rc7.git0.
16131a59cab1.300.fc43.s390x+debug #1 PREEMPT
Krnl PSW :
0404e00180000000 0000020fc86aa1dc (__lock_acquire+0x5c/0x15f0)
R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
Krnl GPRS:
0000000000000000 0000020f00000001 6b6b6b6b6b6b6c33 0000000000000000
0000000000000000 0000000000000000 0000000000000001 0000000000000000
0000000000000000 0000020fca28b820 0000000000000000 0000010a1ced8100
0000010a1ced8100 0000020fc9775068 0000018fce14f8b8 0000018fce14f7f8
Krnl Code:
0000020fc86aa1cc:
e3b003400004 lg %r11,832
0000020fc86aa1d2:
a7840211 brc 8,
0000020fc86aa5f4
*
0000020fc86aa1d6:
c09000df0b25 larl %r9,
0000020fca28b820
>
0000020fc86aa1dc:
d50790002000 clc 0(8,%r9),0(%r2)
0000020fc86aa1e2:
a7840209 brc 8,
0000020fc86aa5f4
0000020fc86aa1e6:
c0e001100401 larl %r14,
0000020fca8aa9e8
0000020fc86aa1ec:
c01000e25a00 larl %r1,
0000020fca2f55ec
0000020fc86aa1f2:
a7eb00e8 aghi %r14,232
Call Trace:
__lock_acquire+0x5c/0x15f0
lock_acquire.part.0+0xf8/0x270
lock_acquire+0xb0/0x1b0
down_write+0x5a/0x250
mlx5_detach_device+0x42/0x110 [mlx5_core]
mlx5_unload_one_devl_locked+0x50/0xc0 [mlx5_core]
mlx5_unload_one+0x42/0x60 [mlx5_core]
mlx5_pci_err_detected+0x94/0x150 [mlx5_core]
zpci_event_attempt_error_recovery+0xcc/0x388
Fixes: 5a977b5833b7 ("net/mlx5: Lag, move devcom registration to LAG layer")
Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20251202-fix_lag-v1-1-59e8177ffce0@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>