]> git.ipfire.org Git - thirdparty/kernel/stable-queue.git/blame - releases/4.9.52/scsi-zfcp-trace-hba-fsf-response-by-default-on-dismiss-or-timedout-late-response.patch
4.9-stable patches
[thirdparty/kernel/stable-queue.git] / releases / 4.9.52 / scsi-zfcp-trace-hba-fsf-response-by-default-on-dismiss-or-timedout-late-response.patch
CommitLineData
721db3cc
GKH
1From fdb7cee3b9e3c561502e58137a837341f10cbf8b Mon Sep 17 00:00:00 2001
2From: Steffen Maier <maier@linux.vnet.ibm.com>
3Date: Fri, 28 Jul 2017 12:30:57 +0200
4Subject: scsi: zfcp: trace HBA FSF response by default on dismiss or timedout late response
5
6From: Steffen Maier <maier@linux.vnet.ibm.com>
7
8commit fdb7cee3b9e3c561502e58137a837341f10cbf8b upstream.
9
10At the default trace level, we only trace unsuccessful events including
11FSF responses.
12
13zfcp_dbf_hba_fsf_response() only used protocol status and FSF status to
14decide on an unsuccessful response. However, this is only one of multiple
15possible sources determining a failed struct zfcp_fsf_req.
16
17An FSF request can also "fail" if its response runs into an ERP timeout
18or if it gets dismissed because a higher level recovery was triggered
19[trace tags "erscf_1" or "erscf_2" in zfcp_erp_strategy_check_fsfreq()].
20FSF requests with ERP timeout are:
21FSF_QTCB_EXCHANGE_CONFIG_DATA, FSF_QTCB_EXCHANGE_PORT_DATA,
22FSF_QTCB_OPEN_PORT_WITH_DID or FSF_QTCB_CLOSE_PORT or
23FSF_QTCB_CLOSE_PHYSICAL_PORT for target ports,
24FSF_QTCB_OPEN_LUN, FSF_QTCB_CLOSE_LUN.
25One example is slow queue processing which can cause follow-on errors,
26e.g. FSF_PORT_ALREADY_OPEN after FSF_QTCB_OPEN_PORT_WITH_DID timed out.
27In order to see the root cause, we need to see late responses even if the
28channel presented them successfully with FSF_PROT_GOOD and FSF_GOOD.
29Example trace records formatted with zfcpdbf from the s390-tools package:
30
31Timestamp : ...
32Area : REC
33Subarea : 00
34Level : 1
35Exception : -
36CPU ID : ..
37Caller : ...
38Record ID : 1
39Tag : fcegpf1
40LUN : 0xffffffffffffffff
41WWPN : 0x<WWPN>
42D_ID : 0x00<D_ID>
43Adapter status : 0x5400050b
44Port status : 0x41200000
45LUN status : 0x00000000
46Ready count : 0x00000001
47Running count : 0x...
48ERP want : 0x02 ZFCP_ERP_ACTION_REOPEN_PORT
49ERP need : 0x02 ZFCP_ERP_ACTION_REOPEN_PORT
50|
51Timestamp : ... 30 seconds later
52Area : REC
53Subarea : 00
54Level : 1
55Exception : -
56CPU ID : ..
57Caller : ...
58Record ID : 2
59Tag : erscf_2
60LUN : 0xffffffffffffffff
61WWPN : 0x<WWPN>
62D_ID : 0x00<D_ID>
63Adapter status : 0x5400050b
64Port status : 0x41200000
65LUN status : 0x00000000
66Request ID : 0x<request_ID>
67ERP status : 0x10000000 ZFCP_STATUS_ERP_TIMEDOUT
68ERP step : 0x0800 ZFCP_ERP_STEP_PORT_OPENING
69ERP action : 0x02 ZFCP_ERP_ACTION_REOPEN_PORT
70ERP count : 0x00
71|
72Timestamp : ... later than previous record
73Area : HBA
74Subarea : 00
75Level : 5 > default level => 3 <= default level
76Exception : -
77CPU ID : 00
78Caller : ...
79Record ID : 1
80Tag : fs_qtcb => fs_rerr
81Request ID : 0x<request_ID>
82Request status : 0x00001010 ZFCP_STATUS_FSFREQ_DISMISSED
83 | ZFCP_STATUS_FSFREQ_CLEANUP
84FSF cmnd : 0x00000005
85FSF sequence no: 0x...
86FSF issued : ... > 30 seconds ago
87FSF stat : 0x00000000 FSF_GOOD
88FSF stat qual : 00000000 00000000 00000000 00000000
89Prot stat : 0x00000001 FSF_PROT_GOOD
90Prot stat qual : 00000000 00000000 00000000 00000000
91Port handle : 0x...
92LUN handle : 0x00000000
93QTCB log length: ...
94QTCB log info : ...
95
96In case of problems detecting that new responses are waiting on the input
97queue, we sooner or later trigger adapter recovery due to an FSF request
98timeout (trace tag "fsrth_1").
99FSF requests with FSF request timeout are:
100typically FSF_QTCB_ABORT_FCP_CMND; but theoretically also
101FSF_QTCB_EXCHANGE_CONFIG_DATA or FSF_QTCB_EXCHANGE_PORT_DATA via sysfs,
102FSF_QTCB_OPEN_PORT_WITH_DID or FSF_QTCB_CLOSE_PORT for WKA ports,
103FSF_QTCB_FCP_CMND for task management function (LUN / target reset).
104One or more pending requests can meanwhile have FSF_PROT_GOOD and FSF_GOOD
105because the channel filled in the response via DMA into the request's QTCB.
106
107In a theroretical case, inject code can create an erroneous FSF request
108on purpose. If data router is enabled, it uses deferred error reporting.
109A READ SCSI command can succeed with FSF_PROT_GOOD, FSF_GOOD, and
110SAM_STAT_GOOD. But on writing the read data to host memory via DMA,
111it can still fail, e.g. if an intentionally wrong scatter list does not
112provide enough space. Rather than getting an unsuccessful response,
113we get a QDIO activate check which in turn triggers adapter recovery.
114One or more pending requests can meanwhile have FSF_PROT_GOOD and FSF_GOOD
115because the channel filled in the response via DMA into the request's QTCB.
116Example trace records formatted with zfcpdbf from the s390-tools package:
117
118Timestamp : ...
119Area : HBA
120Subarea : 00
121Level : 6 > default level => 3 <= default level
122Exception : -
123CPU ID : ..
124Caller : ...
125Record ID : 1
126Tag : fs_norm => fs_rerr
127Request ID : 0x<request_ID2>
128Request status : 0x00001010 ZFCP_STATUS_FSFREQ_DISMISSED
129 | ZFCP_STATUS_FSFREQ_CLEANUP
130FSF cmnd : 0x00000001
131FSF sequence no: 0x...
132FSF issued : ...
133FSF stat : 0x00000000 FSF_GOOD
134FSF stat qual : 00000000 00000000 00000000 00000000
135Prot stat : 0x00000001 FSF_PROT_GOOD
136Prot stat qual : ........ ........ 00000000 00000000
137Port handle : 0x...
138LUN handle : 0x...
139|
140Timestamp : ...
141Area : SCSI
142Subarea : 00
143Level : 3
144Exception : -
145CPU ID : ..
146Caller : ...
147Record ID : 1
148Tag : rsl_err
149Request ID : 0x<request_ID2>
150SCSI ID : 0x...
151SCSI LUN : 0x...
152SCSI result : 0x000e0000 DID_TRANSPORT_DISRUPTED
153SCSI retries : 0x00
154SCSI allowed : 0x05
155SCSI scribble : 0x<request_ID2>
156SCSI opcode : 28... Read(10)
157FCP rsp inf cod: 0x00
158FCP rsp IU : 00000000 00000000 00000000 00000000
159 ^^ SAM_STAT_GOOD
160 00000000 00000000
161
162Only with luck in both above cases, we could see a follow-on trace record
163of an unsuccesful event following a successful but late FSF response with
164FSF_PROT_GOOD and FSF_GOOD. Typically this was the case for I/O requests
165resulting in a SCSI trace record "rsl_err" with DID_TRANSPORT_DISRUPTED
166[On ZFCP_STATUS_FSFREQ_DISMISSED, zfcp_fsf_protstatus_eval() sets
167ZFCP_STATUS_FSFREQ_ERROR seen by the request handler functions as failure].
168However, the reason for this follow-on trace was invisible because the
169corresponding HBA trace record was missing at the default trace level
170(by default hidden records with tags "fs_norm", "fs_qtcb", or "fs_open").
171
172On adapter recovery, after we had shut down the QDIO queues, we perform
173unsuccessful pseudo completions with flag ZFCP_STATUS_FSFREQ_DISMISSED
174for each pending FSF request in zfcp_fsf_req_dismiss_all().
175In order to find the root cause, we need to see all pseudo responses even
176if the channel presented them successfully with FSF_PROT_GOOD and FSF_GOOD.
177
178Therefore, check zfcp_fsf_req.status for ZFCP_STATUS_FSFREQ_DISMISSED
179or ZFCP_STATUS_FSFREQ_ERROR and trace with a new tag "fs_rerr".
180
181It does not matter that there are numerous places which set
182ZFCP_STATUS_FSFREQ_ERROR after the location where we trace an FSF response
183early. These cases are based on protocol status != FSF_PROT_GOOD or
184== FSF_PROT_FSF_STATUS_PRESENTED and are thus already traced by default
185as trace tag "fs_perr" or "fs_ferr" respectively.
186
187NB: The trace record with tag "fssrh_1" for status read buffers on dismiss
188all remains. zfcp_fsf_req_complete() handles this and returns early.
189All other FSF request types are handled separately and as described above.
190
191Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com>
192Fixes: 8a36e4532ea1 ("[SCSI] zfcp: enhancement of zfcp debug features")
193Fixes: 2e261af84cdb ("[SCSI] zfcp: Only collect FSF/HBA debug data for matching trace levels")
194Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com>
195Signed-off-by: Benjamin Block <bblock@linux.vnet.ibm.com>
196Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
197Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
198
199---
200 drivers/s390/scsi/zfcp_dbf.h | 6 +++++-
201 1 file changed, 5 insertions(+), 1 deletion(-)
202
203--- a/drivers/s390/scsi/zfcp_dbf.h
204+++ b/drivers/s390/scsi/zfcp_dbf.h
205@@ -323,7 +323,11 @@ void zfcp_dbf_hba_fsf_response(struct zf
206 {
207 struct fsf_qtcb *qtcb = req->qtcb;
208
209- if ((qtcb->prefix.prot_status != FSF_PROT_GOOD) &&
210+ if (unlikely(req->status & (ZFCP_STATUS_FSFREQ_DISMISSED |
211+ ZFCP_STATUS_FSFREQ_ERROR))) {
212+ zfcp_dbf_hba_fsf_resp("fs_rerr", 3, req);
213+
214+ } else if ((qtcb->prefix.prot_status != FSF_PROT_GOOD) &&
215 (qtcb->prefix.prot_status != FSF_PROT_FSF_STATUS_PRESENTED)) {
216 zfcp_dbf_hba_fsf_resp("fs_perr", 1, req);
217