]>
Commit | Line | Data |
---|---|---|
721db3cc GKH |
1 | From fdb7cee3b9e3c561502e58137a837341f10cbf8b Mon Sep 17 00:00:00 2001 |
2 | From: Steffen Maier <maier@linux.vnet.ibm.com> | |
3 | Date: Fri, 28 Jul 2017 12:30:57 +0200 | |
4 | Subject: scsi: zfcp: trace HBA FSF response by default on dismiss or timedout late response | |
5 | ||
6 | From: Steffen Maier <maier@linux.vnet.ibm.com> | |
7 | ||
8 | commit fdb7cee3b9e3c561502e58137a837341f10cbf8b upstream. | |
9 | ||
10 | At the default trace level, we only trace unsuccessful events including | |
11 | FSF responses. | |
12 | ||
13 | zfcp_dbf_hba_fsf_response() only used protocol status and FSF status to | |
14 | decide on an unsuccessful response. However, this is only one of multiple | |
15 | possible sources determining a failed struct zfcp_fsf_req. | |
16 | ||
17 | An FSF request can also "fail" if its response runs into an ERP timeout | |
18 | or if it gets dismissed because a higher level recovery was triggered | |
19 | [trace tags "erscf_1" or "erscf_2" in zfcp_erp_strategy_check_fsfreq()]. | |
20 | FSF requests with ERP timeout are: | |
21 | FSF_QTCB_EXCHANGE_CONFIG_DATA, FSF_QTCB_EXCHANGE_PORT_DATA, | |
22 | FSF_QTCB_OPEN_PORT_WITH_DID or FSF_QTCB_CLOSE_PORT or | |
23 | FSF_QTCB_CLOSE_PHYSICAL_PORT for target ports, | |
24 | FSF_QTCB_OPEN_LUN, FSF_QTCB_CLOSE_LUN. | |
25 | One example is slow queue processing which can cause follow-on errors, | |
26 | e.g. FSF_PORT_ALREADY_OPEN after FSF_QTCB_OPEN_PORT_WITH_DID timed out. | |
27 | In order to see the root cause, we need to see late responses even if the | |
28 | channel presented them successfully with FSF_PROT_GOOD and FSF_GOOD. | |
29 | Example trace records formatted with zfcpdbf from the s390-tools package: | |
30 | ||
31 | Timestamp : ... | |
32 | Area : REC | |
33 | Subarea : 00 | |
34 | Level : 1 | |
35 | Exception : - | |
36 | CPU ID : .. | |
37 | Caller : ... | |
38 | Record ID : 1 | |
39 | Tag : fcegpf1 | |
40 | LUN : 0xffffffffffffffff | |
41 | WWPN : 0x<WWPN> | |
42 | D_ID : 0x00<D_ID> | |
43 | Adapter status : 0x5400050b | |
44 | Port status : 0x41200000 | |
45 | LUN status : 0x00000000 | |
46 | Ready count : 0x00000001 | |
47 | Running count : 0x... | |
48 | ERP want : 0x02 ZFCP_ERP_ACTION_REOPEN_PORT | |
49 | ERP need : 0x02 ZFCP_ERP_ACTION_REOPEN_PORT | |
50 | | | |
51 | Timestamp : ... 30 seconds later | |
52 | Area : REC | |
53 | Subarea : 00 | |
54 | Level : 1 | |
55 | Exception : - | |
56 | CPU ID : .. | |
57 | Caller : ... | |
58 | Record ID : 2 | |
59 | Tag : erscf_2 | |
60 | LUN : 0xffffffffffffffff | |
61 | WWPN : 0x<WWPN> | |
62 | D_ID : 0x00<D_ID> | |
63 | Adapter status : 0x5400050b | |
64 | Port status : 0x41200000 | |
65 | LUN status : 0x00000000 | |
66 | Request ID : 0x<request_ID> | |
67 | ERP status : 0x10000000 ZFCP_STATUS_ERP_TIMEDOUT | |
68 | ERP step : 0x0800 ZFCP_ERP_STEP_PORT_OPENING | |
69 | ERP action : 0x02 ZFCP_ERP_ACTION_REOPEN_PORT | |
70 | ERP count : 0x00 | |
71 | | | |
72 | Timestamp : ... later than previous record | |
73 | Area : HBA | |
74 | Subarea : 00 | |
75 | Level : 5 > default level => 3 <= default level | |
76 | Exception : - | |
77 | CPU ID : 00 | |
78 | Caller : ... | |
79 | Record ID : 1 | |
80 | Tag : fs_qtcb => fs_rerr | |
81 | Request ID : 0x<request_ID> | |
82 | Request status : 0x00001010 ZFCP_STATUS_FSFREQ_DISMISSED | |
83 | | ZFCP_STATUS_FSFREQ_CLEANUP | |
84 | FSF cmnd : 0x00000005 | |
85 | FSF sequence no: 0x... | |
86 | FSF issued : ... > 30 seconds ago | |
87 | FSF stat : 0x00000000 FSF_GOOD | |
88 | FSF stat qual : 00000000 00000000 00000000 00000000 | |
89 | Prot stat : 0x00000001 FSF_PROT_GOOD | |
90 | Prot stat qual : 00000000 00000000 00000000 00000000 | |
91 | Port handle : 0x... | |
92 | LUN handle : 0x00000000 | |
93 | QTCB log length: ... | |
94 | QTCB log info : ... | |
95 | ||
96 | In case of problems detecting that new responses are waiting on the input | |
97 | queue, we sooner or later trigger adapter recovery due to an FSF request | |
98 | timeout (trace tag "fsrth_1"). | |
99 | FSF requests with FSF request timeout are: | |
100 | typically FSF_QTCB_ABORT_FCP_CMND; but theoretically also | |
101 | FSF_QTCB_EXCHANGE_CONFIG_DATA or FSF_QTCB_EXCHANGE_PORT_DATA via sysfs, | |
102 | FSF_QTCB_OPEN_PORT_WITH_DID or FSF_QTCB_CLOSE_PORT for WKA ports, | |
103 | FSF_QTCB_FCP_CMND for task management function (LUN / target reset). | |
104 | One or more pending requests can meanwhile have FSF_PROT_GOOD and FSF_GOOD | |
105 | because the channel filled in the response via DMA into the request's QTCB. | |
106 | ||
107 | In a theroretical case, inject code can create an erroneous FSF request | |
108 | on purpose. If data router is enabled, it uses deferred error reporting. | |
109 | A READ SCSI command can succeed with FSF_PROT_GOOD, FSF_GOOD, and | |
110 | SAM_STAT_GOOD. But on writing the read data to host memory via DMA, | |
111 | it can still fail, e.g. if an intentionally wrong scatter list does not | |
112 | provide enough space. Rather than getting an unsuccessful response, | |
113 | we get a QDIO activate check which in turn triggers adapter recovery. | |
114 | One or more pending requests can meanwhile have FSF_PROT_GOOD and FSF_GOOD | |
115 | because the channel filled in the response via DMA into the request's QTCB. | |
116 | Example trace records formatted with zfcpdbf from the s390-tools package: | |
117 | ||
118 | Timestamp : ... | |
119 | Area : HBA | |
120 | Subarea : 00 | |
121 | Level : 6 > default level => 3 <= default level | |
122 | Exception : - | |
123 | CPU ID : .. | |
124 | Caller : ... | |
125 | Record ID : 1 | |
126 | Tag : fs_norm => fs_rerr | |
127 | Request ID : 0x<request_ID2> | |
128 | Request status : 0x00001010 ZFCP_STATUS_FSFREQ_DISMISSED | |
129 | | ZFCP_STATUS_FSFREQ_CLEANUP | |
130 | FSF cmnd : 0x00000001 | |
131 | FSF sequence no: 0x... | |
132 | FSF issued : ... | |
133 | FSF stat : 0x00000000 FSF_GOOD | |
134 | FSF stat qual : 00000000 00000000 00000000 00000000 | |
135 | Prot stat : 0x00000001 FSF_PROT_GOOD | |
136 | Prot stat qual : ........ ........ 00000000 00000000 | |
137 | Port handle : 0x... | |
138 | LUN handle : 0x... | |
139 | | | |
140 | Timestamp : ... | |
141 | Area : SCSI | |
142 | Subarea : 00 | |
143 | Level : 3 | |
144 | Exception : - | |
145 | CPU ID : .. | |
146 | Caller : ... | |
147 | Record ID : 1 | |
148 | Tag : rsl_err | |
149 | Request ID : 0x<request_ID2> | |
150 | SCSI ID : 0x... | |
151 | SCSI LUN : 0x... | |
152 | SCSI result : 0x000e0000 DID_TRANSPORT_DISRUPTED | |
153 | SCSI retries : 0x00 | |
154 | SCSI allowed : 0x05 | |
155 | SCSI scribble : 0x<request_ID2> | |
156 | SCSI opcode : 28... Read(10) | |
157 | FCP rsp inf cod: 0x00 | |
158 | FCP rsp IU : 00000000 00000000 00000000 00000000 | |
159 | ^^ SAM_STAT_GOOD | |
160 | 00000000 00000000 | |
161 | ||
162 | Only with luck in both above cases, we could see a follow-on trace record | |
163 | of an unsuccesful event following a successful but late FSF response with | |
164 | FSF_PROT_GOOD and FSF_GOOD. Typically this was the case for I/O requests | |
165 | resulting in a SCSI trace record "rsl_err" with DID_TRANSPORT_DISRUPTED | |
166 | [On ZFCP_STATUS_FSFREQ_DISMISSED, zfcp_fsf_protstatus_eval() sets | |
167 | ZFCP_STATUS_FSFREQ_ERROR seen by the request handler functions as failure]. | |
168 | However, the reason for this follow-on trace was invisible because the | |
169 | corresponding HBA trace record was missing at the default trace level | |
170 | (by default hidden records with tags "fs_norm", "fs_qtcb", or "fs_open"). | |
171 | ||
172 | On adapter recovery, after we had shut down the QDIO queues, we perform | |
173 | unsuccessful pseudo completions with flag ZFCP_STATUS_FSFREQ_DISMISSED | |
174 | for each pending FSF request in zfcp_fsf_req_dismiss_all(). | |
175 | In order to find the root cause, we need to see all pseudo responses even | |
176 | if the channel presented them successfully with FSF_PROT_GOOD and FSF_GOOD. | |
177 | ||
178 | Therefore, check zfcp_fsf_req.status for ZFCP_STATUS_FSFREQ_DISMISSED | |
179 | or ZFCP_STATUS_FSFREQ_ERROR and trace with a new tag "fs_rerr". | |
180 | ||
181 | It does not matter that there are numerous places which set | |
182 | ZFCP_STATUS_FSFREQ_ERROR after the location where we trace an FSF response | |
183 | early. These cases are based on protocol status != FSF_PROT_GOOD or | |
184 | == FSF_PROT_FSF_STATUS_PRESENTED and are thus already traced by default | |
185 | as trace tag "fs_perr" or "fs_ferr" respectively. | |
186 | ||
187 | NB: The trace record with tag "fssrh_1" for status read buffers on dismiss | |
188 | all remains. zfcp_fsf_req_complete() handles this and returns early. | |
189 | All other FSF request types are handled separately and as described above. | |
190 | ||
191 | Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com> | |
192 | Fixes: 8a36e4532ea1 ("[SCSI] zfcp: enhancement of zfcp debug features") | |
193 | Fixes: 2e261af84cdb ("[SCSI] zfcp: Only collect FSF/HBA debug data for matching trace levels") | |
194 | Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com> | |
195 | Signed-off-by: Benjamin Block <bblock@linux.vnet.ibm.com> | |
196 | Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> | |
197 | Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> | |
198 | ||
199 | --- | |
200 | drivers/s390/scsi/zfcp_dbf.h | 6 +++++- | |
201 | 1 file changed, 5 insertions(+), 1 deletion(-) | |
202 | ||
203 | --- a/drivers/s390/scsi/zfcp_dbf.h | |
204 | +++ b/drivers/s390/scsi/zfcp_dbf.h | |
205 | @@ -323,7 +323,11 @@ void zfcp_dbf_hba_fsf_response(struct zf | |
206 | { | |
207 | struct fsf_qtcb *qtcb = req->qtcb; | |
208 | ||
209 | - if ((qtcb->prefix.prot_status != FSF_PROT_GOOD) && | |
210 | + if (unlikely(req->status & (ZFCP_STATUS_FSFREQ_DISMISSED | | |
211 | + ZFCP_STATUS_FSFREQ_ERROR))) { | |
212 | + zfcp_dbf_hba_fsf_resp("fs_rerr", 3, req); | |
213 | + | |
214 | + } else if ((qtcb->prefix.prot_status != FSF_PROT_GOOD) && | |
215 | (qtcb->prefix.prot_status != FSF_PROT_FSF_STATUS_PRESENTED)) { | |
216 | zfcp_dbf_hba_fsf_resp("fs_perr", 1, req); | |
217 |