releases/4.9.52/scsi-zfcp-trace-hba-fsf-response-by-default-on-dismiss-or-timedout-late-response.patch

   1 From fdb7cee3b9e3c561502e58137a837341f10cbf8b Mon Sep 17 00:00:00 2001
   2 From: Steffen Maier <maier@linux.vnet.ibm.com>
   3 Date: Fri, 28 Jul 2017 12:30:57 +0200
   4 Subject: scsi: zfcp: trace HBA FSF response by default on dismiss or timedout late response
   5
   6 From: Steffen Maier <maier@linux.vnet.ibm.com>
   7
   8 commit fdb7cee3b9e3c561502e58137a837341f10cbf8b upstream.
   9
  10 At the default trace level, we only trace unsuccessful events including
  11 FSF responses.
  12
  13 zfcp_dbf_hba_fsf_response() only used protocol status and FSF status to
  14 decide on an unsuccessful response. However, this is only one of multiple
  15 possible sources determining a failed struct zfcp_fsf_req.
  16
  17 An FSF request can also "fail" if its response runs into an ERP timeout
  18 or if it gets dismissed because a higher level recovery was triggered
  19 [trace tags "erscf_1" or "erscf_2" in zfcp_erp_strategy_check_fsfreq()].
  20 FSF requests with ERP timeout are:
  21 FSF_QTCB_EXCHANGE_CONFIG_DATA, FSF_QTCB_EXCHANGE_PORT_DATA,
  22 FSF_QTCB_OPEN_PORT_WITH_DID or FSF_QTCB_CLOSE_PORT or
  23 FSF_QTCB_CLOSE_PHYSICAL_PORT for target ports,
  24 FSF_QTCB_OPEN_LUN, FSF_QTCB_CLOSE_LUN.
  25 One example is slow queue processing which can cause follow-on errors,
  26 e.g. FSF_PORT_ALREADY_OPEN after FSF_QTCB_OPEN_PORT_WITH_DID timed out.
  27 In order to see the root cause, we need to see late responses even if the
  28 channel presented them successfully with FSF_PROT_GOOD and FSF_GOOD.
  29 Example trace records formatted with zfcpdbf from the s390-tools package:
  30
  31 Timestamp      : ...
  32 Area           : REC
  33 Subarea        : 00
  34 Level          : 1
  35 Exception      : -
  36 CPU ID         : ..
  37 Caller         : ...
  38 Record ID      : 1
  39 Tag            : fcegpf1
  40 LUN            : 0xffffffffffffffff
  41 WWPN           : 0x<WWPN>
  42 D_ID           : 0x00<D_ID>
  43 Adapter status : 0x5400050b
  44 Port status    : 0x41200000
  45 LUN status     : 0x00000000
  46 Ready count    : 0x00000001
  47 Running count  : 0x...
  48 ERP want       : 0x02                           ZFCP_ERP_ACTION_REOPEN_PORT
  49 ERP need       : 0x02                           ZFCP_ERP_ACTION_REOPEN_PORT
  50 |
  51 Timestamp      : ...                            30 seconds later
  52 Area           : REC
  53 Subarea        : 00
  54 Level          : 1
  55 Exception      : -
  56 CPU ID         : ..
  57 Caller         : ...
  58 Record ID      : 2
  59 Tag            : erscf_2
  60 LUN            : 0xffffffffffffffff
  61 WWPN           : 0x<WWPN>
  62 D_ID           : 0x00<D_ID>
  63 Adapter status : 0x5400050b
  64 Port status    : 0x41200000
  65 LUN status     : 0x00000000
  66 Request ID     : 0x<request_ID>
  67 ERP status     : 0x10000000                     ZFCP_STATUS_ERP_TIMEDOUT
  68 ERP step       : 0x0800                         ZFCP_ERP_STEP_PORT_OPENING
  69 ERP action     : 0x02                           ZFCP_ERP_ACTION_REOPEN_PORT
  70 ERP count      : 0x00
  71 |
  72 Timestamp      : ...                            later than previous record
  73 Area           : HBA
  74 Subarea        : 00
  75 Level          : 5      > default level         => 3    <= default level
  76 Exception      : -
  77 CPU ID         : 00
  78 Caller         : ...
  79 Record ID      : 1
  80 Tag            : fs_qtcb                        => fs_rerr
  81 Request ID     : 0x<request_ID>
  82 Request status : 0x00001010                     ZFCP_STATUS_FSFREQ_DISMISSED
  83                                                 | ZFCP_STATUS_FSFREQ_CLEANUP
  84 FSF cmnd       : 0x00000005
  85 FSF sequence no: 0x...
  86 FSF issued     : ...                            > 30 seconds ago
  87 FSF stat       : 0x00000000                     FSF_GOOD
  88 FSF stat qual  : 00000000 00000000 00000000 00000000
  89 Prot stat      : 0x00000001                     FSF_PROT_GOOD
  90 Prot stat qual : 00000000 00000000 00000000 00000000
  91 Port handle    : 0x...
  92 LUN handle     : 0x00000000
  93 QTCB log length: ...
  94 QTCB log info  : ...
  95
  96 In case of problems detecting that new responses are waiting on the input
  97 queue, we sooner or later trigger adapter recovery due to an FSF request
  98 timeout (trace tag "fsrth_1").
  99 FSF requests with FSF request timeout are:
 100 typically FSF_QTCB_ABORT_FCP_CMND; but theoretically also
 101 FSF_QTCB_EXCHANGE_CONFIG_DATA or FSF_QTCB_EXCHANGE_PORT_DATA via sysfs,
 102 FSF_QTCB_OPEN_PORT_WITH_DID or FSF_QTCB_CLOSE_PORT for WKA ports,
 103 FSF_QTCB_FCP_CMND for task management function (LUN / target reset).
 104 One or more pending requests can meanwhile have FSF_PROT_GOOD and FSF_GOOD
 105 because the channel filled in the response via DMA into the request's QTCB.
 106
 107 In a theroretical case, inject code can create an erroneous FSF request
 108 on purpose. If data router is enabled, it uses deferred error reporting.
 109 A READ SCSI command can succeed with FSF_PROT_GOOD, FSF_GOOD, and
 110 SAM_STAT_GOOD. But on writing the read data to host memory via DMA,
 111 it can still fail, e.g. if an intentionally wrong scatter list does not
 112 provide enough space. Rather than getting an unsuccessful response,
 113 we get a QDIO activate check which in turn triggers adapter recovery.
 114 One or more pending requests can meanwhile have FSF_PROT_GOOD and FSF_GOOD
 115 because the channel filled in the response via DMA into the request's QTCB.
 116 Example trace records formatted with zfcpdbf from the s390-tools package:
 117
 118 Timestamp      : ...
 119 Area           : HBA
 120 Subarea        : 00
 121 Level          : 6      > default level         => 3    <= default level
 122 Exception      : -
 123 CPU ID         : ..
 124 Caller         : ...
 125 Record ID      : 1
 126 Tag            : fs_norm                        => fs_rerr
 127 Request ID     : 0x<request_ID2>
 128 Request status : 0x00001010                     ZFCP_STATUS_FSFREQ_DISMISSED
 129                                                 | ZFCP_STATUS_FSFREQ_CLEANUP
 130 FSF cmnd       : 0x00000001
 131 FSF sequence no: 0x...
 132 FSF issued     : ...
 133 FSF stat       : 0x00000000                     FSF_GOOD
 134 FSF stat qual  : 00000000 00000000 00000000 00000000
 135 Prot stat      : 0x00000001                     FSF_PROT_GOOD
 136 Prot stat qual : ........ ........ 00000000 00000000
 137 Port handle    : 0x...
 138 LUN handle     : 0x...
 139 |
 140 Timestamp      : ...
 141 Area           : SCSI
 142 Subarea        : 00
 143 Level          : 3
 144 Exception      : -
 145 CPU ID         : ..
 146 Caller         : ...
 147 Record ID      : 1
 148 Tag            : rsl_err
 149 Request ID     : 0x<request_ID2>
 150 SCSI ID        : 0x...
 151 SCSI LUN       : 0x...
 152 SCSI result    : 0x000e0000                     DID_TRANSPORT_DISRUPTED
 153 SCSI retries   : 0x00
 154 SCSI allowed   : 0x05
 155 SCSI scribble  : 0x<request_ID2>
 156 SCSI opcode    : 28...                          Read(10)
 157 FCP rsp inf cod: 0x00
 158 FCP rsp IU     : 00000000 00000000 00000000 00000000
 159                                          ^^     SAM_STAT_GOOD
 160                  00000000 00000000
 161
 162 Only with luck in both above cases, we could see a follow-on trace record
 163 of an unsuccesful event following a successful but late FSF response with
 164 FSF_PROT_GOOD and FSF_GOOD. Typically this was the case for I/O requests
 165 resulting in a SCSI trace record "rsl_err" with DID_TRANSPORT_DISRUPTED
 166 [On ZFCP_STATUS_FSFREQ_DISMISSED, zfcp_fsf_protstatus_eval() sets
 167 ZFCP_STATUS_FSFREQ_ERROR seen by the request handler functions as failure].
 168 However, the reason for this follow-on trace was invisible because the
 169 corresponding HBA trace record was missing at the default trace level
 170 (by default hidden records with tags "fs_norm", "fs_qtcb", or "fs_open").
 171
 172 On adapter recovery, after we had shut down the QDIO queues, we perform
 173 unsuccessful pseudo completions with flag ZFCP_STATUS_FSFREQ_DISMISSED
 174 for each pending FSF request in zfcp_fsf_req_dismiss_all().
 175 In order to find the root cause, we need to see all pseudo responses even
 176 if the channel presented them successfully with FSF_PROT_GOOD and FSF_GOOD.
 177
 178 Therefore, check zfcp_fsf_req.status for ZFCP_STATUS_FSFREQ_DISMISSED
 179 or ZFCP_STATUS_FSFREQ_ERROR and trace with a new tag "fs_rerr".
 180
 181 It does not matter that there are numerous places which set
 182 ZFCP_STATUS_FSFREQ_ERROR after the location where we trace an FSF response
 183 early. These cases are based on protocol status != FSF_PROT_GOOD or
 184 == FSF_PROT_FSF_STATUS_PRESENTED and are thus already traced by default
 185 as trace tag "fs_perr" or "fs_ferr" respectively.
 186
 187 NB: The trace record with tag "fssrh_1" for status read buffers on dismiss
 188 all remains. zfcp_fsf_req_complete() handles this and returns early.
 189 All other FSF request types are handled separately and as described above.
 190
 191 Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com>
 192 Fixes: 8a36e4532ea1 ("[SCSI] zfcp: enhancement of zfcp debug features")
 193 Fixes: 2e261af84cdb ("[SCSI] zfcp: Only collect FSF/HBA debug data for matching trace levels")
 194 Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com>
 195 Signed-off-by: Benjamin Block <bblock@linux.vnet.ibm.com>
 196 Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
 197 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 198
 199 ---
 200  drivers/s390/scsi/zfcp_dbf.h |    6 +++++-
 201  1 file changed, 5 insertions(+), 1 deletion(-)
 202
 203 --- a/drivers/s390/scsi/zfcp_dbf.h
 204 +++ b/drivers/s390/scsi/zfcp_dbf.h
 205 @@ -323,7 +323,11 @@ void zfcp_dbf_hba_fsf_response(struct zf
 206  {
 207         struct fsf_qtcb *qtcb = req->qtcb;
 208
 209 -       if ((qtcb->prefix.prot_status != FSF_PROT_GOOD) &&
 210 +       if (unlikely(req->status & (ZFCP_STATUS_FSFREQ_DISMISSED |
 211 +                                   ZFCP_STATUS_FSFREQ_ERROR))) {
 212 +               zfcp_dbf_hba_fsf_resp("fs_rerr", 3, req);
 213 +
 214 +       } else if ((qtcb->prefix.prot_status != FSF_PROT_GOOD) &&
 215             (qtcb->prefix.prot_status != FSF_PROT_FSF_STATUS_PRESENTED)) {
 216                 zfcp_dbf_hba_fsf_resp("fs_perr", 1, req);
 217