[thirdparty/kernel/stable-queue.git] / releases / 4.9.52 / scsi-zfcp-trace-hba-fsf-response-by-default-on-dismiss-or-timedout-late-response.patch

From fdb7cee3b9e3c561502e58137a837341f10cbf8b Mon Sep 17 00:00:00 2001
From: Steffen Maier <maier@linux.vnet.ibm.com>
Date: Fri, 28 Jul 2017 12:30:57 +0200
Subject: scsi: zfcp: trace HBA FSF response by default on dismiss or timedout late response

From: Steffen Maier <maier@linux.vnet.ibm.com>

commit fdb7cee3b9e3c561502e58137a837341f10cbf8b upstream.

At the default trace level, we only trace unsuccessful events including
FSF responses.

zfcp_dbf_hba_fsf_response() only used protocol status and FSF status to
decide on an unsuccessful response. However, this is only one of multiple
possible sources determining a failed struct zfcp_fsf_req.

An FSF request can also "fail" if its response runs into an ERP timeout
or if it gets dismissed because a higher level recovery was triggered
[trace tags "erscf_1" or "erscf_2" in zfcp_erp_strategy_check_fsfreq()].
FSF requests with ERP timeout are:
FSF_QTCB_EXCHANGE_CONFIG_DATA, FSF_QTCB_EXCHANGE_PORT_DATA,
FSF_QTCB_OPEN_PORT_WITH_DID or FSF_QTCB_CLOSE_PORT or
FSF_QTCB_CLOSE_PHYSICAL_PORT for target ports,
FSF_QTCB_OPEN_LUN, FSF_QTCB_CLOSE_LUN.
One example is slow queue processing which can cause follow-on errors,
e.g. FSF_PORT_ALREADY_OPEN after FSF_QTCB_OPEN_PORT_WITH_DID timed out.
In order to see the root cause, we need to see late responses even if the
channel presented them successfully with FSF_PROT_GOOD and FSF_GOOD.
Example trace records formatted with zfcpdbf from the s390-tools package:

Timestamp      : ...
Area           : REC
Subarea        : 00
Level          : 1
Exception      : -
CPU ID         : ..
Caller         : ...
Record ID      : 1
Tag            : fcegpf1
LUN            : 0xffffffffffffffff
WWPN           : 0x<WWPN>
D_ID           : 0x00<D_ID>
Adapter status : 0x5400050b
Port status    : 0x41200000
LUN status     : 0x00000000
Ready count    : 0x00000001
Running count  : 0x...
ERP want       : 0x02				ZFCP_ERP_ACTION_REOPEN_PORT
ERP need       : 0x02				ZFCP_ERP_ACTION_REOPEN_PORT
|
Timestamp      : ...				30 seconds later
Area           : REC
Subarea        : 00
Level          : 1
Exception      : -
CPU ID         : ..
Caller         : ...
Record ID      : 2
Tag            : erscf_2
LUN            : 0xffffffffffffffff
WWPN           : 0x<WWPN>
D_ID           : 0x00<D_ID>
Adapter status : 0x5400050b
Port status    : 0x41200000
LUN status     : 0x00000000
Request ID     : 0x<request_ID>
ERP status     : 0x10000000			ZFCP_STATUS_ERP_TIMEDOUT
ERP step       : 0x0800				ZFCP_ERP_STEP_PORT_OPENING
ERP action     : 0x02				ZFCP_ERP_ACTION_REOPEN_PORT
ERP count      : 0x00
|
Timestamp      : ...				later than previous record
Area           : HBA
Subarea        : 00
Level          : 5	> default level		=> 3	<= default level
Exception      : -
CPU ID         : 00
Caller         : ...
Record ID      : 1
Tag            : fs_qtcb			=> fs_rerr
Request ID     : 0x<request_ID>
Request status : 0x00001010			ZFCP_STATUS_FSFREQ_DISMISSED
						| ZFCP_STATUS_FSFREQ_CLEANUP
FSF cmnd       : 0x00000005
FSF sequence no: 0x...
FSF issued     : ...				> 30 seconds ago
FSF stat       : 0x00000000			FSF_GOOD
FSF stat qual  : 00000000 00000000 00000000 00000000
Prot stat      : 0x00000001			FSF_PROT_GOOD
Prot stat qual : 00000000 00000000 00000000 00000000
Port handle    : 0x...
LUN handle     : 0x00000000
QTCB log length: ...
QTCB log info  : ...

In case of problems detecting that new responses are waiting on the input
queue, we sooner or later trigger adapter recovery due to an FSF request
timeout (trace tag "fsrth_1").
FSF requests with FSF request timeout are:
typically FSF_QTCB_ABORT_FCP_CMND; but theoretically also
FSF_QTCB_EXCHANGE_CONFIG_DATA or FSF_QTCB_EXCHANGE_PORT_DATA via sysfs,
FSF_QTCB_OPEN_PORT_WITH_DID or FSF_QTCB_CLOSE_PORT for WKA ports,
FSF_QTCB_FCP_CMND for task management function (LUN / target reset).
One or more pending requests can meanwhile have FSF_PROT_GOOD and FSF_GOOD
because the channel filled in the response via DMA into the request's QTCB.

In a theroretical case, inject code can create an erroneous FSF request
on purpose. If data router is enabled, it uses deferred error reporting.
A READ SCSI command can succeed with FSF_PROT_GOOD, FSF_GOOD, and
SAM_STAT_GOOD. But on writing the read data to host memory via DMA,
it can still fail, e.g. if an intentionally wrong scatter list does not
provide enough space. Rather than getting an unsuccessful response,
we get a QDIO activate check which in turn triggers adapter recovery.
One or more pending requests can meanwhile have FSF_PROT_GOOD and FSF_GOOD
because the channel filled in the response via DMA into the request's QTCB.
Example trace records formatted with zfcpdbf from the s390-tools package:

Timestamp      : ...
Area           : HBA
Subarea        : 00
Level          : 6	> default level		=> 3	<= default level
Exception      : -
CPU ID         : ..
Caller         : ...
Record ID      : 1
Tag            : fs_norm			=> fs_rerr
Request ID     : 0x<request_ID2>
Request status : 0x00001010			ZFCP_STATUS_FSFREQ_DISMISSED
						| ZFCP_STATUS_FSFREQ_CLEANUP
FSF cmnd       : 0x00000001
FSF sequence no: 0x...
FSF issued     : ...
FSF stat       : 0x00000000			FSF_GOOD
FSF stat qual  : 00000000 00000000 00000000 00000000
Prot stat      : 0x00000001			FSF_PROT_GOOD
Prot stat qual : ........ ........ 00000000 00000000
Port handle    : 0x...
LUN handle     : 0x...
|
Timestamp      : ...
Area           : SCSI
Subarea        : 00
Level          : 3
Exception      : -
CPU ID         : ..
Caller         : ...
Record ID      : 1
Tag            : rsl_err
Request ID     : 0x<request_ID2>
SCSI ID        : 0x...
SCSI LUN       : 0x...
SCSI result    : 0x000e0000			DID_TRANSPORT_DISRUPTED
SCSI retries   : 0x00
SCSI allowed   : 0x05
SCSI scribble  : 0x<request_ID2>
SCSI opcode    : 28...				Read(10)
FCP rsp inf cod: 0x00
FCP rsp IU     : 00000000 00000000 00000000 00000000
                                         ^^	SAM_STAT_GOOD
                 00000000 00000000

Only with luck in both above cases, we could see a follow-on trace record
of an unsuccesful event following a successful but late FSF response with
FSF_PROT_GOOD and FSF_GOOD. Typically this was the case for I/O requests
resulting in a SCSI trace record "rsl_err" with DID_TRANSPORT_DISRUPTED
[On ZFCP_STATUS_FSFREQ_DISMISSED, zfcp_fsf_protstatus_eval() sets
ZFCP_STATUS_FSFREQ_ERROR seen by the request handler functions as failure].
However, the reason for this follow-on trace was invisible because the
corresponding HBA trace record was missing at the default trace level
(by default hidden records with tags "fs_norm", "fs_qtcb", or "fs_open").

On adapter recovery, after we had shut down the QDIO queues, we perform
unsuccessful pseudo completions with flag ZFCP_STATUS_FSFREQ_DISMISSED
for each pending FSF request in zfcp_fsf_req_dismiss_all().
In order to find the root cause, we need to see all pseudo responses even
if the channel presented them successfully with FSF_PROT_GOOD and FSF_GOOD.

Therefore, check zfcp_fsf_req.status for ZFCP_STATUS_FSFREQ_DISMISSED
or ZFCP_STATUS_FSFREQ_ERROR and trace with a new tag "fs_rerr".

It does not matter that there are numerous places which set
ZFCP_STATUS_FSFREQ_ERROR after the location where we trace an FSF response
early. These cases are based on protocol status != FSF_PROT_GOOD or
== FSF_PROT_FSF_STATUS_PRESENTED and are thus already traced by default
as trace tag "fs_perr" or "fs_ferr" respectively.

NB: The trace record with tag "fssrh_1" for status read buffers on dismiss
all remains. zfcp_fsf_req_complete() handles this and returns early.
All other FSF request types are handled separately and as described above.

Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com>
Fixes: 8a36e4532ea1 ("[SCSI] zfcp: enhancement of zfcp debug features")
Fixes: 2e261af84cdb ("[SCSI] zfcp: Only collect FSF/HBA debug data for matching trace levels")
Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Signed-off-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 drivers/s390/scsi/zfcp_dbf.h |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

--- a/drivers/s390/scsi/zfcp_dbf.h
+++ b/drivers/s390/scsi/zfcp_dbf.h
@@ -323,7 +323,11 @@ void zfcp_dbf_hba_fsf_response(struct zf
 {
 	struct fsf_qtcb *qtcb = req->qtcb;
 
-	if ((qtcb->prefix.prot_status != FSF_PROT_GOOD) &&
+	if (unlikely(req->status & (ZFCP_STATUS_FSFREQ_DISMISSED |
+				    ZFCP_STATUS_FSFREQ_ERROR))) {
+		zfcp_dbf_hba_fsf_resp("fs_rerr", 3, req);
+
+	} else if ((qtcb->prefix.prot_status != FSF_PROT_GOOD) &&
 	    (qtcb->prefix.prot_status != FSF_PROT_FSF_STATUS_PRESENTED)) {
 		zfcp_dbf_hba_fsf_resp("fs_perr", 1, req);
Commit	Line	Data
721db3cc GKH	1	From fdb7cee3b9e3c561502e58137a837341f10cbf8b Mon Sep 17 00:00:00 2001
	2	From: Steffen Maier <maier@linux.vnet.ibm.com>
	3	Date: Fri, 28 Jul 2017 12:30:57 +0200
	4	Subject: scsi: zfcp: trace HBA FSF response by default on dismiss or timedout late response
	5
	6	From: Steffen Maier <maier@linux.vnet.ibm.com>
	7
	8	commit fdb7cee3b9e3c561502e58137a837341f10cbf8b upstream.
	9
	10	At the default trace level, we only trace unsuccessful events including
	11	FSF responses.
	12
	13	zfcp_dbf_hba_fsf_response() only used protocol status and FSF status to
	14	decide on an unsuccessful response. However, this is only one of multiple
	15	possible sources determining a failed struct zfcp_fsf_req.
	16
	17	An FSF request can also "fail" if its response runs into an ERP timeout
	18	or if it gets dismissed because a higher level recovery was triggered
	19	[trace tags "erscf_1" or "erscf_2" in zfcp_erp_strategy_check_fsfreq()].
	20	FSF requests with ERP timeout are:
	21	FSF_QTCB_EXCHANGE_CONFIG_DATA, FSF_QTCB_EXCHANGE_PORT_DATA,
	22	FSF_QTCB_OPEN_PORT_WITH_DID or FSF_QTCB_CLOSE_PORT or
	23	FSF_QTCB_CLOSE_PHYSICAL_PORT for target ports,
	24	FSF_QTCB_OPEN_LUN, FSF_QTCB_CLOSE_LUN.
	25	One example is slow queue processing which can cause follow-on errors,
	26	e.g. FSF_PORT_ALREADY_OPEN after FSF_QTCB_OPEN_PORT_WITH_DID timed out.
	27	In order to see the root cause, we need to see late responses even if the
	28	channel presented them successfully with FSF_PROT_GOOD and FSF_GOOD.
	29	Example trace records formatted with zfcpdbf from the s390-tools package:
	30
	31	Timestamp : ...
	32	Area : REC
	33	Subarea : 00
	34	Level : 1
	35	Exception : -
	36	CPU ID : ..
	37	Caller : ...
	38	Record ID : 1
	39	Tag : fcegpf1
	40	LUN : 0xffffffffffffffff
	41	WWPN : 0x<WWPN>
	42	D_ID : 0x00<D_ID>
	43	Adapter status : 0x5400050b
	44	Port status : 0x41200000
	45	LUN status : 0x00000000
	46	Ready count : 0x00000001
	47	Running count : 0x...
	48	ERP want : 0x02 ZFCP_ERP_ACTION_REOPEN_PORT
	49	ERP need : 0x02 ZFCP_ERP_ACTION_REOPEN_PORT
	50	\|
	51	Timestamp : ... 30 seconds later
	52	Area : REC
	53	Subarea : 00
	54	Level : 1
	55	Exception : -
	56	CPU ID : ..
	57	Caller : ...
	58	Record ID : 2
	59	Tag : erscf_2
	60	LUN : 0xffffffffffffffff
	61	WWPN : 0x<WWPN>
	62	D_ID : 0x00<D_ID>
	63	Adapter status : 0x5400050b
	64	Port status : 0x41200000
65	LUN status : 0x00000000
66	Request ID : 0x<request_ID>
67	ERP status : 0x10000000 ZFCP_STATUS_ERP_TIMEDOUT
68	ERP step : 0x0800 ZFCP_ERP_STEP_PORT_OPENING
69	ERP action : 0x02 ZFCP_ERP_ACTION_REOPEN_PORT
70	ERP count : 0x00
71	\|
72	Timestamp : ... later than previous record
73	Area : HBA
74	Subarea : 00
75	Level : 5 > default level => 3 <= default level
76	Exception : -
77	CPU ID : 00
78	Caller : ...
79	Record ID : 1
80	Tag : fs_qtcb => fs_rerr
81	Request ID : 0x<request_ID>
82	Request status : 0x00001010 ZFCP_STATUS_FSFREQ_DISMISSED
83	\| ZFCP_STATUS_FSFREQ_CLEANUP
84	FSF cmnd : 0x00000005
85	FSF sequence no: 0x...
86	FSF issued : ... > 30 seconds ago
87	FSF stat : 0x00000000 FSF_GOOD
88	FSF stat qual : 00000000 00000000 00000000 00000000
89	Prot stat : 0x00000001 FSF_PROT_GOOD
90	Prot stat qual : 00000000 00000000 00000000 00000000
91	Port handle : 0x...
92	LUN handle : 0x00000000
93	QTCB log length: ...
94	QTCB log info : ...
95
96	In case of problems detecting that new responses are waiting on the input
97	queue, we sooner or later trigger adapter recovery due to an FSF request
98	timeout (trace tag "fsrth_1").
99	FSF requests with FSF request timeout are:
100	typically FSF_QTCB_ABORT_FCP_CMND; but theoretically also
101	FSF_QTCB_EXCHANGE_CONFIG_DATA or FSF_QTCB_EXCHANGE_PORT_DATA via sysfs,
102	FSF_QTCB_OPEN_PORT_WITH_DID or FSF_QTCB_CLOSE_PORT for WKA ports,
103	FSF_QTCB_FCP_CMND for task management function (LUN / target reset).
104	One or more pending requests can meanwhile have FSF_PROT_GOOD and FSF_GOOD
105	because the channel filled in the response via DMA into the request's QTCB.
106
107	In a theroretical case, inject code can create an erroneous FSF request
108	on purpose. If data router is enabled, it uses deferred error reporting.
109	A READ SCSI command can succeed with FSF_PROT_GOOD, FSF_GOOD, and
110	SAM_STAT_GOOD. But on writing the read data to host memory via DMA,
111	it can still fail, e.g. if an intentionally wrong scatter list does not
112	provide enough space. Rather than getting an unsuccessful response,
113	we get a QDIO activate check which in turn triggers adapter recovery.
114	One or more pending requests can meanwhile have FSF_PROT_GOOD and FSF_GOOD
115	because the channel filled in the response via DMA into the request's QTCB.
116	Example trace records formatted with zfcpdbf from the s390-tools package:
117
118	Timestamp : ...
119	Area : HBA
120	Subarea : 00
121	Level : 6 > default level => 3 <= default level
122	Exception : -
123	CPU ID : ..
124	Caller : ...
125	Record ID : 1
126	Tag : fs_norm => fs_rerr
127	Request ID : 0x<request_ID2>
128	Request status : 0x00001010 ZFCP_STATUS_FSFREQ_DISMISSED
129	\| ZFCP_STATUS_FSFREQ_CLEANUP
130	FSF cmnd : 0x00000001
131	FSF sequence no: 0x...
132	FSF issued : ...
133	FSF stat : 0x00000000 FSF_GOOD
134	FSF stat qual : 00000000 00000000 00000000 00000000
135	Prot stat : 0x00000001 FSF_PROT_GOOD
136	Prot stat qual : ........ ........ 00000000 00000000
137	Port handle : 0x...
138	LUN handle : 0x...
139	\|
140	Timestamp : ...
141	Area : SCSI
142	Subarea : 00
143	Level : 3
144	Exception : -
145	CPU ID : ..
146	Caller : ...
147	Record ID : 1
148	Tag : rsl_err
149	Request ID : 0x<request_ID2>
150	SCSI ID : 0x...
151	SCSI LUN : 0x...
152	SCSI result : 0x000e0000 DID_TRANSPORT_DISRUPTED
153	SCSI retries : 0x00
154	SCSI allowed : 0x05
155	SCSI scribble : 0x<request_ID2>
156	SCSI opcode : 28... Read(10)
157	FCP rsp inf cod: 0x00
158	FCP rsp IU : 00000000 00000000 00000000 00000000
159	^^ SAM_STAT_GOOD
160	00000000 00000000
161
162	Only with luck in both above cases, we could see a follow-on trace record
163	of an unsuccesful event following a successful but late FSF response with
164	FSF_PROT_GOOD and FSF_GOOD. Typically this was the case for I/O requests
165	resulting in a SCSI trace record "rsl_err" with DID_TRANSPORT_DISRUPTED
166	[On ZFCP_STATUS_FSFREQ_DISMISSED, zfcp_fsf_protstatus_eval() sets
167	ZFCP_STATUS_FSFREQ_ERROR seen by the request handler functions as failure].
168	However, the reason for this follow-on trace was invisible because the
169	corresponding HBA trace record was missing at the default trace level
170	(by default hidden records with tags "fs_norm", "fs_qtcb", or "fs_open").
171
172	On adapter recovery, after we had shut down the QDIO queues, we perform
173	unsuccessful pseudo completions with flag ZFCP_STATUS_FSFREQ_DISMISSED
174	for each pending FSF request in zfcp_fsf_req_dismiss_all().
175	In order to find the root cause, we need to see all pseudo responses even
176	if the channel presented them successfully with FSF_PROT_GOOD and FSF_GOOD.
177
178	Therefore, check zfcp_fsf_req.status for ZFCP_STATUS_FSFREQ_DISMISSED
179	or ZFCP_STATUS_FSFREQ_ERROR and trace with a new tag "fs_rerr".
180
181	It does not matter that there are numerous places which set
182	ZFCP_STATUS_FSFREQ_ERROR after the location where we trace an FSF response
183	early. These cases are based on protocol status != FSF_PROT_GOOD or
184	== FSF_PROT_FSF_STATUS_PRESENTED and are thus already traced by default
185	as trace tag "fs_perr" or "fs_ferr" respectively.
186
187	NB: The trace record with tag "fssrh_1" for status read buffers on dismiss
188	all remains. zfcp_fsf_req_complete() handles this and returns early.
189	All other FSF request types are handled separately and as described above.
190
191	Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com>
192	Fixes: 8a36e4532ea1 ("[SCSI] zfcp: enhancement of zfcp debug features")
193	Fixes: 2e261af84cdb ("[SCSI] zfcp: Only collect FSF/HBA debug data for matching trace levels")
194	Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com>
195	Signed-off-by: Benjamin Block <bblock@linux.vnet.ibm.com>
196	Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
197	Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
198
199	---
200	drivers/s390/scsi/zfcp_dbf.h \| 6 +++++-
201	1 file changed, 5 insertions(+), 1 deletion(-)
202
203	--- a/drivers/s390/scsi/zfcp_dbf.h
204	+++ b/drivers/s390/scsi/zfcp_dbf.h
205	@@ -323,7 +323,11 @@ void zfcp_dbf_hba_fsf_response(struct zf
206	{
207	struct fsf_qtcb *qtcb = req->qtcb;
208
209	- if ((qtcb->prefix.prot_status != FSF_PROT_GOOD) &&
210	+ if (unlikely(req->status & (ZFCP_STATUS_FSFREQ_DISMISSED \|
211	+ ZFCP_STATUS_FSFREQ_ERROR))) {
212	+ zfcp_dbf_hba_fsf_resp("fs_rerr", 3, req);
213	+
214	+ } else if ((qtcb->prefix.prot_status != FSF_PROT_GOOD) &&
215	(qtcb->prefix.prot_status != FSF_PROT_FSF_STATUS_PRESENTED)) {
216	zfcp_dbf_hba_fsf_resp("fs_perr", 1, req);
217