]> git.ipfire.org Git - thirdparty/kernel/stable-queue.git/blob - releases/4.9.52/scsi-zfcp-trace-hba-fsf-response-by-default-on-dismiss-or-timedout-late-response.patch
4.9-stable patches
[thirdparty/kernel/stable-queue.git] / releases / 4.9.52 / scsi-zfcp-trace-hba-fsf-response-by-default-on-dismiss-or-timedout-late-response.patch
1 From fdb7cee3b9e3c561502e58137a837341f10cbf8b Mon Sep 17 00:00:00 2001
2 From: Steffen Maier <maier@linux.vnet.ibm.com>
3 Date: Fri, 28 Jul 2017 12:30:57 +0200
4 Subject: scsi: zfcp: trace HBA FSF response by default on dismiss or timedout late response
5
6 From: Steffen Maier <maier@linux.vnet.ibm.com>
7
8 commit fdb7cee3b9e3c561502e58137a837341f10cbf8b upstream.
9
10 At the default trace level, we only trace unsuccessful events including
11 FSF responses.
12
13 zfcp_dbf_hba_fsf_response() only used protocol status and FSF status to
14 decide on an unsuccessful response. However, this is only one of multiple
15 possible sources determining a failed struct zfcp_fsf_req.
16
17 An FSF request can also "fail" if its response runs into an ERP timeout
18 or if it gets dismissed because a higher level recovery was triggered
19 [trace tags "erscf_1" or "erscf_2" in zfcp_erp_strategy_check_fsfreq()].
20 FSF requests with ERP timeout are:
21 FSF_QTCB_EXCHANGE_CONFIG_DATA, FSF_QTCB_EXCHANGE_PORT_DATA,
22 FSF_QTCB_OPEN_PORT_WITH_DID or FSF_QTCB_CLOSE_PORT or
23 FSF_QTCB_CLOSE_PHYSICAL_PORT for target ports,
24 FSF_QTCB_OPEN_LUN, FSF_QTCB_CLOSE_LUN.
25 One example is slow queue processing which can cause follow-on errors,
26 e.g. FSF_PORT_ALREADY_OPEN after FSF_QTCB_OPEN_PORT_WITH_DID timed out.
27 In order to see the root cause, we need to see late responses even if the
28 channel presented them successfully with FSF_PROT_GOOD and FSF_GOOD.
29 Example trace records formatted with zfcpdbf from the s390-tools package:
30
31 Timestamp : ...
32 Area : REC
33 Subarea : 00
34 Level : 1
35 Exception : -
36 CPU ID : ..
37 Caller : ...
38 Record ID : 1
39 Tag : fcegpf1
40 LUN : 0xffffffffffffffff
41 WWPN : 0x<WWPN>
42 D_ID : 0x00<D_ID>
43 Adapter status : 0x5400050b
44 Port status : 0x41200000
45 LUN status : 0x00000000
46 Ready count : 0x00000001
47 Running count : 0x...
48 ERP want : 0x02 ZFCP_ERP_ACTION_REOPEN_PORT
49 ERP need : 0x02 ZFCP_ERP_ACTION_REOPEN_PORT
50 |
51 Timestamp : ... 30 seconds later
52 Area : REC
53 Subarea : 00
54 Level : 1
55 Exception : -
56 CPU ID : ..
57 Caller : ...
58 Record ID : 2
59 Tag : erscf_2
60 LUN : 0xffffffffffffffff
61 WWPN : 0x<WWPN>
62 D_ID : 0x00<D_ID>
63 Adapter status : 0x5400050b
64 Port status : 0x41200000
65 LUN status : 0x00000000
66 Request ID : 0x<request_ID>
67 ERP status : 0x10000000 ZFCP_STATUS_ERP_TIMEDOUT
68 ERP step : 0x0800 ZFCP_ERP_STEP_PORT_OPENING
69 ERP action : 0x02 ZFCP_ERP_ACTION_REOPEN_PORT
70 ERP count : 0x00
71 |
72 Timestamp : ... later than previous record
73 Area : HBA
74 Subarea : 00
75 Level : 5 > default level => 3 <= default level
76 Exception : -
77 CPU ID : 00
78 Caller : ...
79 Record ID : 1
80 Tag : fs_qtcb => fs_rerr
81 Request ID : 0x<request_ID>
82 Request status : 0x00001010 ZFCP_STATUS_FSFREQ_DISMISSED
83 | ZFCP_STATUS_FSFREQ_CLEANUP
84 FSF cmnd : 0x00000005
85 FSF sequence no: 0x...
86 FSF issued : ... > 30 seconds ago
87 FSF stat : 0x00000000 FSF_GOOD
88 FSF stat qual : 00000000 00000000 00000000 00000000
89 Prot stat : 0x00000001 FSF_PROT_GOOD
90 Prot stat qual : 00000000 00000000 00000000 00000000
91 Port handle : 0x...
92 LUN handle : 0x00000000
93 QTCB log length: ...
94 QTCB log info : ...
95
96 In case of problems detecting that new responses are waiting on the input
97 queue, we sooner or later trigger adapter recovery due to an FSF request
98 timeout (trace tag "fsrth_1").
99 FSF requests with FSF request timeout are:
100 typically FSF_QTCB_ABORT_FCP_CMND; but theoretically also
101 FSF_QTCB_EXCHANGE_CONFIG_DATA or FSF_QTCB_EXCHANGE_PORT_DATA via sysfs,
102 FSF_QTCB_OPEN_PORT_WITH_DID or FSF_QTCB_CLOSE_PORT for WKA ports,
103 FSF_QTCB_FCP_CMND for task management function (LUN / target reset).
104 One or more pending requests can meanwhile have FSF_PROT_GOOD and FSF_GOOD
105 because the channel filled in the response via DMA into the request's QTCB.
106
107 In a theroretical case, inject code can create an erroneous FSF request
108 on purpose. If data router is enabled, it uses deferred error reporting.
109 A READ SCSI command can succeed with FSF_PROT_GOOD, FSF_GOOD, and
110 SAM_STAT_GOOD. But on writing the read data to host memory via DMA,
111 it can still fail, e.g. if an intentionally wrong scatter list does not
112 provide enough space. Rather than getting an unsuccessful response,
113 we get a QDIO activate check which in turn triggers adapter recovery.
114 One or more pending requests can meanwhile have FSF_PROT_GOOD and FSF_GOOD
115 because the channel filled in the response via DMA into the request's QTCB.
116 Example trace records formatted with zfcpdbf from the s390-tools package:
117
118 Timestamp : ...
119 Area : HBA
120 Subarea : 00
121 Level : 6 > default level => 3 <= default level
122 Exception : -
123 CPU ID : ..
124 Caller : ...
125 Record ID : 1
126 Tag : fs_norm => fs_rerr
127 Request ID : 0x<request_ID2>
128 Request status : 0x00001010 ZFCP_STATUS_FSFREQ_DISMISSED
129 | ZFCP_STATUS_FSFREQ_CLEANUP
130 FSF cmnd : 0x00000001
131 FSF sequence no: 0x...
132 FSF issued : ...
133 FSF stat : 0x00000000 FSF_GOOD
134 FSF stat qual : 00000000 00000000 00000000 00000000
135 Prot stat : 0x00000001 FSF_PROT_GOOD
136 Prot stat qual : ........ ........ 00000000 00000000
137 Port handle : 0x...
138 LUN handle : 0x...
139 |
140 Timestamp : ...
141 Area : SCSI
142 Subarea : 00
143 Level : 3
144 Exception : -
145 CPU ID : ..
146 Caller : ...
147 Record ID : 1
148 Tag : rsl_err
149 Request ID : 0x<request_ID2>
150 SCSI ID : 0x...
151 SCSI LUN : 0x...
152 SCSI result : 0x000e0000 DID_TRANSPORT_DISRUPTED
153 SCSI retries : 0x00
154 SCSI allowed : 0x05
155 SCSI scribble : 0x<request_ID2>
156 SCSI opcode : 28... Read(10)
157 FCP rsp inf cod: 0x00
158 FCP rsp IU : 00000000 00000000 00000000 00000000
159 ^^ SAM_STAT_GOOD
160 00000000 00000000
161
162 Only with luck in both above cases, we could see a follow-on trace record
163 of an unsuccesful event following a successful but late FSF response with
164 FSF_PROT_GOOD and FSF_GOOD. Typically this was the case for I/O requests
165 resulting in a SCSI trace record "rsl_err" with DID_TRANSPORT_DISRUPTED
166 [On ZFCP_STATUS_FSFREQ_DISMISSED, zfcp_fsf_protstatus_eval() sets
167 ZFCP_STATUS_FSFREQ_ERROR seen by the request handler functions as failure].
168 However, the reason for this follow-on trace was invisible because the
169 corresponding HBA trace record was missing at the default trace level
170 (by default hidden records with tags "fs_norm", "fs_qtcb", or "fs_open").
171
172 On adapter recovery, after we had shut down the QDIO queues, we perform
173 unsuccessful pseudo completions with flag ZFCP_STATUS_FSFREQ_DISMISSED
174 for each pending FSF request in zfcp_fsf_req_dismiss_all().
175 In order to find the root cause, we need to see all pseudo responses even
176 if the channel presented them successfully with FSF_PROT_GOOD and FSF_GOOD.
177
178 Therefore, check zfcp_fsf_req.status for ZFCP_STATUS_FSFREQ_DISMISSED
179 or ZFCP_STATUS_FSFREQ_ERROR and trace with a new tag "fs_rerr".
180
181 It does not matter that there are numerous places which set
182 ZFCP_STATUS_FSFREQ_ERROR after the location where we trace an FSF response
183 early. These cases are based on protocol status != FSF_PROT_GOOD or
184 == FSF_PROT_FSF_STATUS_PRESENTED and are thus already traced by default
185 as trace tag "fs_perr" or "fs_ferr" respectively.
186
187 NB: The trace record with tag "fssrh_1" for status read buffers on dismiss
188 all remains. zfcp_fsf_req_complete() handles this and returns early.
189 All other FSF request types are handled separately and as described above.
190
191 Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com>
192 Fixes: 8a36e4532ea1 ("[SCSI] zfcp: enhancement of zfcp debug features")
193 Fixes: 2e261af84cdb ("[SCSI] zfcp: Only collect FSF/HBA debug data for matching trace levels")
194 Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com>
195 Signed-off-by: Benjamin Block <bblock@linux.vnet.ibm.com>
196 Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
197 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
198
199 ---
200 drivers/s390/scsi/zfcp_dbf.h | 6 +++++-
201 1 file changed, 5 insertions(+), 1 deletion(-)
202
203 --- a/drivers/s390/scsi/zfcp_dbf.h
204 +++ b/drivers/s390/scsi/zfcp_dbf.h
205 @@ -323,7 +323,11 @@ void zfcp_dbf_hba_fsf_response(struct zf
206 {
207 struct fsf_qtcb *qtcb = req->qtcb;
208
209 - if ((qtcb->prefix.prot_status != FSF_PROT_GOOD) &&
210 + if (unlikely(req->status & (ZFCP_STATUS_FSFREQ_DISMISSED |
211 + ZFCP_STATUS_FSFREQ_ERROR))) {
212 + zfcp_dbf_hba_fsf_resp("fs_rerr", 3, req);
213 +
214 + } else if ((qtcb->prefix.prot_status != FSF_PROT_GOOD) &&
215 (qtcb->prefix.prot_status != FSF_PROT_FSF_STATUS_PRESENTED)) {
216 zfcp_dbf_hba_fsf_resp("fs_perr", 1, req);
217