How do I interpret scsi status messages in RHEL like "sd 2:0:0:243: SCSI error: return code = 0x08000002"?

Updated

Issue

Environment

  • Red Hat Enterprise Linux (RHEL) 2.1, 3, 4, 5, 6

Resolution

Each return code consists of four parts and is returned by the scsi mid-layer. Note that the return code can sometimes appear truncated, that is has less than 4 parts, as leading zeros are sometimes suppressed when output.

0x08  00  00  02
   D   C   B   A
               A: status_byte = set from target device,                     e.g. SAM_STAT_CHECK_CONDITION
           B    : msg_byte    = return status from host adapter itself,     e.g. COMMAND_COMPLETE
       C        : host_byte   = set by low-level driver to indicate status, e.g. DID_RESET
   D            : driver_byte = set by mid-level,                           e.g. DRIVER_SENSE


kernel/kernel-2.6.9/linux-2.6.9/include/scsi/scsi.h::
#define status_byte(result)     ( (result) >>  1) & 0x1f)  {note: see scsi.h -- this is NOT the preferred method. See [A] below}
#define msg_byte(result)        (((result) >>  8) & 0xff) 
#define host_byte(result)       (((result) >> 16) & 0xff) 
#define driver_byte(result)     (((result) >> 24) & 0xff)  

The following sections list the code values for each of the above four parts.


[A] SCSI Status Byte

The scsi status bytes is almost always from the storage target/device itself. The one exception is the DID_ERROR status which is set and returned by the driver when it detects some anomalous condition with the returned status. One common such condition is when the returned status from storage is success but the residual byte (transferred data) count is non-zero.... the command is saying both "its all good" and "I didn't do everything you asked me to do" at the same time -- these contradict each other and the driver doesn't know what to make of the information so returns a DID_ERROR, driver internally detected an error. The cause is still the same: something at the storage target/device itself.

Note that a value of 00 within the byte of the return code might mean either a success or that the io never got sent to storage and the error is described elsewhere in the status word. For example, a host status byte of `DID_NO_CONNECT` means the host had no route/path to the specified storage target and therefore unable to send the requested command. In this case the scsi status byte will be the default value of 00 but that doesn't mean a successful scsi command status in this case.

From scsi.h:
  :
  * SCSI Architecture Model (SAM) Status codes, Taken from SAM-3 draft
  * T10/1561-D Revision 4 Draft dated 7th November 2002, 
  */

define SAM_STAT_GOOD 0x00


define SAM_STAT_CHECK_CONDITION 0x02


define SAM_STAT_CONDITION_MET 0x04


define SAM_STAT_BUSY 0x08


define SAM_STAT_INTERMEDIATE 0x10


define SAM_STAT_INTERMEDIATE_CONDITION_MET 0x14


define SAM_STAT_RESERVATION_CONFLICT 0x18


define SAM_STAT_COMMAND_TERMINATED 0x22 /* obsolete in SAM-3 */


define SAM_STAT_TASK_SET_FULL 0x28


define SAM_STAT_ACA_ACTIVE 0x30


define SAM_STAT_TASK_ABORTED 0x40


The above are more fully defined within the SCSI standards. The following information is a synopsis of what these status mean:

Status Hex Description
GOOD 00 Target has successfully completed the command.
CHECK CONDITION 02 Indicates a contingent allegiance condition has occurred (see sense buffer for more details)
CONDITION MET 04 Requested operation is satisfied.
BUSY 08 Indicates the target is busy. Returned whenever a target is unable to accept a command from an otherwise acceptable initiator.
INTERMEDIATE 10 Shall be returned for every successfully completed command in a series of linked commands (except the last command)
INTERMEDIATE-CONDITION MET 14 Combination of CONDITION MET and INTERMEDIATE status's.
RESERVATION CONFLICT 18 Logical unit or an extent (portion) within the logical unit is reserved for another device.
COMMAND TERMINATED 22 Target terminates the current I/O process. This also indicates that a contingent allegiance condition has occurred.
QUEUE FULL (TASK SET FULL) 28

Shall be implemented if tagged command queuing is supported. Indicated that the target command queue is full.

ACA ACTIVE 30 Indicates an auto contingent allegiance condition exists.



All other codes
Reserved.
The Green code is what we're hoping for, the Gold ones are the most common non-success statuses that one might see.  The Blue ones are rarely seen. The Red ones are very uncommon and almost never are seen.

Note:See below from scsi.h. The status_byte() macro listed above from this same file does do a shift by which as stated in the same file is NOT the preferred method. The SAM_STAT_* values showing the whole byte of the scsi command status is the preferred method within the code. However, there is still code that refers to just the embedded 5 bit field as the scsi status which can cause confusion. See [SCSI Status Code, Sense Buffer (sense key, asc, ascq) Quick Reference](/knowledge/node/20391) for more information on the scsi status byte format.

From scsi.h: (the referenced SAM status codes are listed above)
 :
 * Status codes, These are deprecated as they are shifted 1 bit right
 * from those found in the SCSI standards, This causes confusion for
 * applications that are ported to several OSes, Prefer SAM Status codes
 * above,
 */

define GOOD 0x00


define CHECK_CONDITION 0x01


define CONDITION_GOOD 0x02


define BUSY 0x04


 


[B] Message byte, msg_byte()

From scsi.h:

  :
  * MESSAGE CODES
  */

define COMMAND_COMPLETE 0x00


define EXTENDED_MESSAGE 0x01


define EXTENDED_MODIFY_DATA_POINTER 0x00


define EXTENDED_SDTR 0x01


define EXTENDED_EXTENDED_IDENTIFY 0x02 /* SCSI-I only */


define EXTENDED_WDTR 0x03


define SAVE_POINTERS 0x02


define RESTORE_POINTERS 0x03


define DISCONNECT 0x04


define INITIATOR_ERROR 0x05


define ABORT 0x06


define MESSAGE_REJECT 0x07


define NOP 0x08


define MSG_PARITY_ERROR 0x09


define LINKED_CMD_COMPLETE 0x0a


define LINKED_FLG_CMD_COMPLETE 0x0b


define BUS_DEVICE_RESET 0x0c

define INITIATE_RECOVERY 0x0f /* SCSI-II only */


define RELEASE_RECOVERY 0x10 /* SCSI-II only */


define SIMPLE_QUEUE_TAG 0x20


define HEAD_OF_QUEUE_TAG 0x21


define ORDERED_QUEUE_TAG 0x22


 


[C] Host byte, host_byte()

From scsi.h:

  :
  * Host byte codes 
  */
 ____RHEL____
 5    6   7           Name                   Value     Description
 x    x   x   #define DID_OK                  0x00  /* NO error                                */
 x    x   x   #define DID_NO_CONNECT          0x01  /* Couldn't connect before timeout period  */
 x    x   x   #define DID_BUS_BUSY            0x02  /* BUS stayed busy through time out period */
 x    x   x   #define DID_TIME_OUT            0x03  /* TIMED OUT for other reason              */
 x    x   x   #define DID_BAD_TARGET          0x04  /* BAD target.                             */
 x    x   x   #define DID_ABORT               0x05  /* Told to abort for some other reason     */
 x    x   x   #define DID_PARITY              0x06  /* Parity error                            */
 x    x   x   #define DID_ERROR               0x07  /* Internal error                          */
 x    x   x   #define DID_RESET               0x08  /* Reset by somebody.                      */
 x    x   x   #define DID_BAD_INTR            0x09  /* Got an interrupt we weren't expecting.  */
 x    x   x   #define DID_PASSTHROUGH         0x0a  /* Force command past mid-layer            */
 x    x   x   #define DID_SOFT_ERROR          0x0b  /* The low level driver just wish a retry  */
 x    x   x   #define DID_IMM_RETRY           0x0c  /* Retry without decrementing retry count  */
 x    x   x   #define DID_REQUEUE             0x0d  /* Requeue command (no immediate retry) also
                                                     * without decrementing the retry count    */
 +[1]  x   x   #define DID_TRANSPORT_DISRUPTED 0x0e  /* Transport error disrupted execution
                                                     * and the driver blocked the port to
                                                     * recover the link. Transport class will
                                                     * retry or fail IO */
 +[1]  x   x   #define DID_TRANSPORT_FAILFAST  0x0f /* Transport class fastfailed the io        */
 +[2]  x   x   #define DID_TARGET_FAILURE      0x10 /* Permanent target failure, do not retry on
                                                    * other paths                              */
 +[2]  x   x   #define DID_NEXUS_FAILURE       0x11 /* Permanent nexus failure, retry on other
                                                    * paths might yield different results      */
 -    +[3] x   #define DID_ALLOC_FAILURE       0x12 /* Space allocation on device failed        */
 -    +[3] x   #define DID_MEDIUM_FAILURE      0x13 /* Medium error                             */

Notes: [1]RHEL 5: Added in RHEL 5.4 and later. [2]RHEL 5: Added in RHEL 5.8 and later. [3]RHEL 6: Added in RHEL 6.6 and later.

 


[D] Driver byte, driver_byte()

From scsi.h:

:
#define DRIVER_OK           0x00    /* Driver status                        */

/*

  • These indicate the error that occurred, and what is available
    */

define DRIVER_BUSY 0x01


define DRIVER_SOFT 0x02


define DRIVER_MEDIA 0x03


define DRIVER_ERROR 0x04


define DRIVER_INVALID 0x05


define DRIVER_TIMEOUT 0x06


define DRIVER_HARD 0x07


define DRIVER_SENSE 0x08 < sense buffer available from target about event


define SUGGEST_RETRY 0x10


define SUGGEST_ABORT 0x20


define SUGGEST_REMAP 0x30


define SUGGEST_DIE 0x40


define SUGGEST_SENSE 0x80


define SUGGEST_IS_OK 0xff


define DRIVER_MASK 0x0f


define SUGGEST_MASK 0xf0


Return Code Information

00.00.00.18 RESERVATION CONFLICT

0x00.00.00.18
           18   status byte : SAM_STAT_RESERVATION_CONFLICT - device reserved to another HBA, 
                                                              command failed
        00         msg byte : <{likely} not valid, see other fields>
     00           host byte : <{likely} not valid, see other fields>
  00            driver byte : <{likely} not valid, see other fields>



===NOTES===================================================================================

    o 00.00.00.18 RESERVATION CONFLICT
        . this is a scsi status returned from the target device

Example:

-------------------------------------------------------------------------------------------

reservation conflict and Error Code: 0x00000018  in messages for the lun
   . two types of reservations
        . reserve/release (typically tapes, not disks, exclusive between device and one hba)
        . persistent reservations (survives storage power cycle) (typically cluster fencing
          method for disks) shared reservation across multiple initiators
   . see How can I view, create, and remove SCSI reservations and keys?,
     to see if lun is reserved by another host

If there are a lot of "reservation conflict" messages in dmesg, that normally means that the
scsi device which the system tried to access is reserved by another node and it cannot be
accessed at that time. To resolve the problem, you should verify that the device access
configuration of your application is correct or not, and contact your application vendor 
to get further assistance.

Also see This content is not included.Why I got a lot of "reservation conflict" on dmesg when system bootup?


----------------------------------------------------------------------------------------------

Notes: had case on vm guest with reservation conflicts.


    $ sg_persist --in -k -d /dev/sdc
    $ sg_persist --in -r -d /dev/sdc

showed clean device without PR reservations of any kind at the guest.  A sg_tur probably would
have helped to detect if reserve/release was being used on these devices.  But also, with virt
layers its possible they were injecting the reservation conflict.  Especially since return code
was very odd:

return code = 0x00110018

0x11 host byte is undefined in RHEL prior to RHEL 5.8 (this was 5.6)... probably from the 3rd
party hypervisor so should contact 3rd party hypervisor vendor.

 
 

 

00.01.00.00 DID_NO_CONNECT

0x00 01 00 00
           00   status byte : {likely} not valid, see other fields
        00         msg byte : {likely} not valid, see other fields
     01           host byte : DID_NO_CONNECT - couldn't connect before timeout period 
                                               {possibly device doesn't exist}
  00            driver byte : {likely} not valid, see other fields

=== NOTES ===================================================================================

        + often means device is no longer accessible,
        + sometimes issued after burning through all retries
        + other times issued immediately if the target device is no longer connected
          to the san (for example, storage port disconnected from san)

Example:
kernel: qla2xxx 0000:0d:00\,1: LOOP DOWN detected (2 5 0)
kernel: sd 0:0:0:1: SCSI error: return code = 0x00010000

Summary:
IO could not be issued as there is no connection to the device.

Description:
The IO command is being rejected with an error status of DID_NO_CONNECTION. Either access to the
device is temporarily unavailable (as in this example, a LOOP or LINK DOWN condition has
occurred - we don't know when the connection will return), or the device is no longer available
within the configuration. This might happen if the storage is reconfigured to remove that lun
from being exported to this host for example. This status is not immediately returned, but is
returned after the timeout period has expired. In other words, the io is queued up and ready to
go but has no place to be sent.

More/Next Steps:
Review the messages files to see if there is an explanation of why the device isn't available
LOOP and LINK DOWN events are explicit, RSCN processing that removes a lun or nport less so. If
the DID_NO_CONNECT is being reported across multiple luns on an HBA, then likely there is an
unexpected storage related issue present. If the status is only being returned against one lun
and if there are multiple paths to this lun, across multiple HBAs then likely the lun was
removed from the configuration. This has inadvertently occurred on shared storage in the past.

Additional information that is important includes: is this a permanent condition or does the
device return at some point? Does a reboot bring the device back? Are multiple devices
involved? Multiple HBAs? Multiple systems (that share the storage)? Is this a one-off event or
burst, is it continuous, or it is happening sporadically over time?

Developing a storage diagram of what systems are attached to which nports on the storage
controller and which luns are being exported to where can be useful in understanding the
complexity of the storage configuration. 

In some cases, actions being performed within the storage controller can cause this type of
behavior were there any maintenance or configuration changes being taken at the time of the
events? There are ways of tuning the system to be more tolerant of storage controllers going
away suddenly such that the kernel is able to wait longer periods of time and ride through SAN
or controller perturbations.

Additional steps that could be taken would be to

1. Have the customer engage their storage vendor if there is no explanation and you are 
   unable to recover access to the device.
2. Turn on additional kernel messages such as additional scsi or driver extended logging, and/or
3. Look into changing multipath and driver parameters, such as
4. lpfc_devloss_tmo (earlier lpfc_nodev_tmo), increase the timeout the lpfc driver waits for a
   device to return
5. no_path_retry, increase the number of retries that multipath waits for the path to return
   for example
6. baseline loading information may need to be gathered (`iostat`, `vmstat`, `top`, etc.)

Recommendations:
Request step 1 and get the vendor case number posted within the ticket. We'll need this if/when
engaging the vendor from our side of things -- if it comes to that.

Recommend step 2 to be implemented upon customer discretion. Its often fairly lightweight in
terms of system impact depending on what flags are set.

Request step 3 only if the storage vendor doesn't find anything and is looking for additional
information/cooperation on the issue. Sometimes issues are related to loading. That shouldn't
be the case for DID_NO_CONNECT, but if other avenues have been exhausted then this is something
to try.


 
 

 

00.02.00.00 DID_BUS_BUSY

0x00 02 00 00
           00   status byte : {likely} not valid, see other fields;
        00         msg byte : {likely} not valid, see other fields;
     02           host byte : DID_BUS_BUSY - bus stayed busy throughout the timeout period
                                             {transport failure likely}
  00            driver byte : {likely} not valid, see other fields;

=== NOTES ===================================================================================

        + Parallel SCSI buses: there is an actual bus busy signal/line on the bus. 
          While the bus is busy, the adapter cannot arbitrate for the bus
          to send out the next command. If the bus stays busy too long, this error
          results. For example, some older tapes were guilty of holding onto the bus 
          during long multi-megabyte transfers so as to not loose streaming and 
          so while backups were running this behavior would result. Typically you 
          shouldn't be seeing this type of behavior though.

        + usually a retriable event, expectation is "bus" will become unbusy at some point 
          where "bus" means transport to device and not the device itself 

        + *if* the bus busy is associated with RSCN message processing (look for information
          in messages around the same time), then the bus busy condition may be related to 
          RSCN processing. The `lpfc` and `qla2xxx` drivers have a configuration option to 
          change the processing of RSCN messages which can, in some cases, prevent timeouts 
          and associated bus busy reporting. The options would be added to `/etc/modprobe.conf`
          and a new `initrd` created plus reboot in order to put these new options into effect.

          Note that these options may not be available on older distributions so verify that  
          they are available within the kernel and that the problem seems related to RSCN
          processing. The driver post 4.4 release should contain this option plus the enhanced
          RSCN processing code.  See BZ 213921 for more information

          options lpfc lpfc_use_adisc=1
          options qla2xxx ql2xprocessrscn=1

        + Some common SAN issues that could contribute to getting this status are link down or 
          still being recovered status, the port status change processing (for example the above
          RSCN processing), low level FC buffer protocol issues (not enough buffer credits to  
          exchange frames), etc.  The bus busy from recovering link or port processing is a 
          result of said processing taking longer than usual the secondary result is that
          command trying to be sent are blocked/stalled too long resulting in "bus busy"
          status.  That is, "I couldn't send this command out because the bus was unavailable
          for too long a period of time".


Example:
kernel: SCSI error : <2 0 0 1> return code = 0x20000
kernel: Buffer I/O error on device sdd, logical block 3

Summary:
A bus busy condition is being returned back from the hardware which prevents the command from
being processed.

Description:
In the majority of cases there is some underlying SAN/storage issue causing this either 
directly or indirectly.

More/Next Steps:
1. review messages file for other events
2. collect iostat/vmstat/blktrace from system, compare/contrast io rates while the problem
   isn't being reported vs when it is.  The only works well if collected data is from all
   boxes connected to shared storage ports.
3. have san storage support look at switch and hba statistics to see if any counters are incrementing

Recommendations:
Gather iostats information to see if the problem is load induced or related.
Engage san/storage support group to review san.

 

00.02.00.08 DID_BUS_BUSY + {SCSI} BUSY

0x00 02 00 08
           08   status byte : SAM_STAT_BUSY - device {returned} busy {status}
        00         msg byte : {likely} not valid, see other fields;
     02           host byte : DID_BUS_BUSY - bus stayed busy throughout the timeout period
                                             {transport failure likely}
  00            driver byte : {likely} not valid, see other fields;

=== NOTES ===================================================================================


        + since the scsi command status returned from the target is BUSY status, this is 
          truely device is busy problem and not a transport issue.  Sometimes a device goes 
          busy when the storage controller is reset or if the device itself is undergoing 
          an internal reset and has not finished yet or a raid volume rebuild is in progress
          and access is either blocked or slow.  Usually retries will ride through 
          these busy times and eventually complete ok.  If not, then the target device/storage 
          controller should be reviewed as to why this issue is occurring.

        + exception: lpfc driver, its sets this particular status when the nport is not in the 
          expected NLP_STE_MAPPED_NODE state but does exist.  Still a storage side issue, but 
          setting lpfc logging for ELS and FCP would gather more information on when the port 
          changed state (ELS) and what the response packet/command back from the adapter had 
          for additional information (FCP).  

The SAM_STAT_BUSY is the scsi status code back from the target (disk, device).

SCSI Status
Hex     Description
08      BUSY               Indicates the target is busy. Returned whenever a target is 
                           unable to accept a command from an otherwise acceptable initiator.

Example:
Aug 27 08:36:49 hostname kernel: lpfc 0000:07:00.0: 0:1305 Link Down Event x6 received Data: x6 x20 x80110
Aug 27 08:36:49 hostname kernel: sd 1:0:0:400: SCSI error: return code = 0x00020008
Aug 27 08:36:49 hostname kernel: end_request: I/O error, dev sdk, sector 3267384
Aug 27 08:36:49 hostname kernel: device-mapper: multipath: Failing path 8:160.
Aug 27 08:36:49 hostname multipathd: 8:160: mark as failed

Summary:

Description:

More/Next Steps:
1. If lpfc, set additional logging via echo 4115 > /sys/class/scsi_host/host/lpfc_log_verbose
   might be useful.
   4115 = 0x1013 ; LOG_ELS, LOG_DISCOVERY, LOG_LINK_EVENT, LOG_FCP_ERROR


Recommendations:

 

00.04.00.00 BAD_TARGET

0x00 04 00 00
              00   status byte : {likely} not valid, see other fields;
           00         msg byte : {likely} not valid, see other fields;
        04           host byte : DID_BAD_TARGET - bad target
     00            driver byte : {likely} not valid, see other fields;

=== NOTES ===================================================================================

        + Something about the behavior of this target is being flagged, need to 
          look into the driver, messages, and other data to determine specifically 
          why its being flagged as a bad target.

        + This is a software detected hardware problem (data provided back from device
          is out of range, has conflicting information, or fails sanity checks)

	+ For different drivers it can mean different things. For example:
          - iscsi, sets this status if 
              . it receives a scsi CHECK CONDITION status back from the device but 
                an invalid buffer size (device is saying check the sense data and 
                then not providing proper sense data to look at)
              . it receives a scsi UNDERRUN or OVERRUN condition but the residual 
                byte count does not match expectations
          - megaraid, sets this if 
              . the lun number is too high (lun number higher than supported by driver)
          - libata, sets this if 
              . device no longer present (as if it got caught in hot-plug device removal)

Example:
kernel: sd 0:0:1:0: SCSI error: return code = 0x00040000
kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
kernel: raid1: Disk failure on sda3, disabling device.


Summary:
Target device is not usable.

Description:
This may be transitory, but if permanent usually other events logged in messages at/near same
timeframe. The target device is no longer available to send commands to.

More/Next Steps:

Recommendations:
Engage storage h/w support, likely storage side issue present.


 

00.07.00.00 DID_ERROR - driver internal error


0x00 07 00 00
           00   status byte : {likely} not valid, see other fields;
        00         msg byte : {likely} not valid, see other fields;
     07           host byte : DID_ERROR - internal error
  00            driver byte : {likely} not valid, see other fields;

=== NOTES ===================================================================================

  + DID_ERROR - driver detected an internal error condition within the response data
    returned from storage hardware.

  + DID_ERROR is assigned within the driver when the returned data from the HBA
    doesn't make sense (like SUCCESS scsi status on io command, but residual byte count
    returned indicates not all requested data was transferred).

  + DID_ERRORs are often akin to software detecting that some type of hardware error
    present.

  + How and where a specific driver detects such anomalies is driver type and version
    dependent.

  + Most places within the driver that set DID_ERROR are also covered by extended event
    logging, so turning on the driver's additional logging often will provide additional
    information as to specific cause.
      o also, for FC HBAs monitor the FC port statistics, if available.  See
        <a href="https://access.redhat.com/site/articles/543493"><i>"What do the fc statistics under /sys/class/fc_host/hostN/statistics mean"</i></a>

  + generally with FC drivers/hba, DID_ERROR is assigned when a hardware or san-based
    issue is present within the storage subsystem such that the received fibre channel
    response frames received back from the hba at command completion time contains
    invalid or has conflicting information in some way.  Some command examples of why
    DID_ERROR is returned:
      o The FC response frame indicates the response length field is valid, but
        this means, by FC specification, that the length must be 0, 4, or 8 but
        the length field is not any of these allowed sizes.
      o The scsi protocol data includes a sense buffer and indicates the whole
        of included scsi data is N bytes, but the FC "wrapper" indicates that it
        is carrying only X byte of encapsulated protocol data where X < N.  For
        example, the scsi data might provide a sense buffer length of 24 bytes,
        but the fibre channel frame indicates it is carrying only 8 bytes of total scsi
        data -- essentially in impossible condition for the driver to reconcile.
      o Other cases are similar:
          - a data overrun where there shouldn't be,
          - a data underrun but the hba's count of frames and data is different
            from the information within response frames,
          - a invalid or unexpected status returned by the lpfc's firmware.
          - queue full condition, but underrun and residual byte counts weren't updated or
            set by storage controller so they don't agree
          - underrun condition specified in status from storage but residual byte count
            does not match up with the transferred byte count within the HBA's firmware
          - underrun condition detected due to dropped frame (storage returned success,
            but the host didn't receive all the data transmitted to it by storage)
          - overrun condition detected, HBA's transferred byte count exceeds the count
            within the storage FC response frame.
          - firmware detected invalid fields within the FC response frame from storage,
            such as incorrect io entry order, count, parameter, and/or unknown status
            type.
          - driver received information from the HBA firmware that doesn't match a known
            valid response.

  + Upon completion of a scsi command, storage communicates the command status back
    to the host.  In a fibre channel environment this is done by sending back an
    FC response frame.  The frame consist of two parts -- FC "wrapper" contain protocol
    neutral information and a payload within the frame which contains SCSI protocol
    specific data.  The scsi data includes scsi status, sense buffer (if available),
    etc.
    o A common cause of DID_ERRORs are when the information within the FC "wrapper"
      conflicts with the information within the scsi status/sense information.
    o Need to check the specific driver source code as to why a specific instance
      of DID_ERROR is being returned.
    o To determine where within the driver the specific instance of DID_ERROR is
      occurring, driver extended event logging will need to be enabled.

Example:
Oct 28 13:33:24 hostname kernel: sd 2:0:0:2: SCSI error: return code = 0x00070000
Oct 28 13:33:24 hostname kernel: end_request: I/O error, dev sdc, sector 1010494514
Oct 28 13:47:23 hostname kernel: sd 2:0:0:0: SCSI error: return code = 0x00070000
Oct 28 13:47:23 hostname kernel: end_request: I/O error, dev sda, sector 7831441538
Oct 28 13:47:56 hostname kernel: sd 2:0:0:0: SCSI error: return code = 0x00070000
Oct 28 13:47:56 hostname kernel: end_request: I/O error, dev sda, sector 9141689458
Oct 28 13:48:06 hostname kernel: sd 2:0:0:2: SCSI error: return code = 0x00070000
Oct 28 13:48:06 hostname kernel: end_request: I/O error, dev sdc, sector 6798352378
Oct 28 13:48:34 hostname kernel: sd 2:0:0:0: SCSI error: return code = 0x00070000
Oct 28 13:48:34 hostname kernel: end_request: I/O error, dev sda, sector 4050283778

Summary:
The driver detected an anomalous condition within the returned completion information from
storage.

Description:
The driver is detecting an error condition with the io completion information and
setting DID_ERROR to signify that the io completion is suspect. Typically a DID_ERROR
indicates a h/w (hba/cable/switch/storage) side issue.

More/Next Steps:
Review the messages. Is the problem only occurring on specific scsi devices
(all 0:0:: would indicate just one scsi host, but if its happening on 0:0::
and say 1:0::) then its happening on multiple scsi hosts. How frequently?
Ultimately this is a h/w side issue but which devices or device sets are effected
influence where the problem could be located on the h/w side of things.
Review the systems hardware, switch error counters, etc. to see if there is any
indication of where the issue might lie. The most likely candidate is the hba itself.

Recommendations:

  1. engage storage vendor support
  2. check switch error counters
  3. monitor host side HBA error counters
  4. turn on extended driver logging, if available, most drivers like lpfc
    and qla2xxx drivers that set DID_ERROR have extended logging in and around
    most (but not all!) places within the driver that set DID_ERROR.

The driver is reporting that it is receiving odd/unexpected/invalid information
from the hba. This generally indicates an issue within the SAN(external to the OS).

Review the system's hardware, switch error counters, etc. to see if there is
any indication of where the issue might lie. The most likely candidate is the hba
itself. Were you able to replace the HBA? Could also be a bad GBIC or cable.

If desired, driver extended logging can be enabled. It won't log additional information
in all cases, but in most cases it will. The data provided may not provide additional
insight into the problem, but if you wish to enable it anyways:

+ To turn on extended driver logging for the qla2xxx driver, see
  <a href="https://access.redhat.com/knowledge/articles/337813"> "[Troubleshooting] How do I turn on additional qla2xxx or qla4xxx driver extended logging and what logging is available?"</a>

+ To turn on extended driver logging for the lpfc driver, use flag value 64 (LOG_FCP) and see
  <a href="https://access.redhat.com/knowledge/articles/337853"> "[Troubleshooting] How do I turn on additional lpfc driver extended logging and what logging is available?"</a>

NOTE: this will generate a LOT of logging with most of it being normal stuff that is
completely unrelated to any problems and the information logged isn't easy to decode in a
straight forward way. As a result, the recommendation is to not enable this logging unless
absolutely necessary or recommended within a support case.

 

00.07.00.28 SAM_STAT_TASK_SET_FULL + DID_ERROR - queue full condition


0x00.07.00.28
           28   status byte : SAM_STAT_TASK_SET_FULL - essentially a target queue full condition
        00         msg byte : <{likely} not valid, see other fields>
     07           host byte : DID_ERROR - internal error
  00            driver byte : <{likely} not valid, see other fields>

===NOTES===================================================================================

o 00.07.00.28 SAM_STAT_TASK_SET_FULL (storage device queue full condition) + 
      +------ DID_ERROR

Example:

kernel: sd 7:0:3:42: Unhandled error code
kernel: sd 7:0:3:42: SCSI error: return code = 0x00000028
kernel: Result: hostbyte=HOST_OK driverbyte=DRIVER_OK,SUGGEST_OK

Summary:
The key issue is the queue full condition. The DID_ERROR is likely secondary to the queue
full. The queue full condition is being returned by storage indicating that it is not
able to handle any additional io requests at that time. If the queue fulls are only
happening occasionally, then they can be safely ignored. The kernel will retry the io.
However, it they happen frequently or frequently enough so that they are having impact
on the system then they need to be addressed.

Description:
DID_ERRORs are assigned by the driver upon detection of conflicting status/information
returned from storage (for example successful io completion but residual byte count
non-zero, that is the read or write completed "successfully" but didn't read or write
all the data -- to which the driver goes "huh?!" and sets DID_ERROR.

The queue full condition means too many io commands have arrived at the storage device
exceeding the queue limit within storage. IO could end up being dropped resulting in
timeouts, and in general its a red flag from storage that the connected host or hosts
are overdriving it.

The lun queue_depth (see /sys/block/sd*/device/queue_depth) is on a per lun basis. If
there are a lot of luns exported to storage to this host or the set of luns exported to
the set of hosts that share storage is a large number, then this can lead to queue full
conditions when all luns become highly active. Essentially the lun queue_depth is set
too high for the number of luns exported from storage to the host vs the activity level
of the system.

Shared storage servicing multiple hosts can increase the likelihood of this type of
error status.

Retries for io returning queue full (or busy), is a delayed retry to try and let
storage recover.

More/Next Steps:
Examine the current value of queue_depth within /sys/block/sd*/device/queue_depth.
Determine the number of luns from storage to this host (if non-shared storage) or the
total number of luns exported to all hosts. Reduce the queue_depth.

Recommendations:

  1. Reduce the lun queue depth to avoid storage getting into a queue full condition.
  2. Examine and possibly reduce the lun queue depth on other hosts sharing the
    same storage depending on load.

For the DID_ERROR:

NOTE: DID_ERROR can be set due storage returning a queue full condition but not
setting the rest of the response frame information correctly. This can result
in the driver detecting a mismatch in response frame data resulting in a DID_ERROR.

If the DID_ERROR needs further examination, either before or after addressing the
queue full condition within storage, then you can try the following steps.

  1. Turn on extended driver logging to try and ascertain which specific DID_ERROR is
    being triggered. {See additional information, DID_ERROR may be a consequence of
    queue full plus storage not fully setting up the response frame in fibre channel
    environments.}
  2. review messages file after queue full/did error is logged.

Also check the driver for updates within processing of storage queue full handling and
firmware detected underrun conditions. If the qla2xxx driver, see bugzilla 805280.
Other FC drivers might need the same storage workaround implementation.

NOTE: even with the driver workaround to prevent DID_ERRORs and immediate retries,
without lowering lun queue_depth the system can still encounter QUEUE FULL conditions.
Storage may not be able to recover once in a queue full condition depending on the
number of hosts sharing the storage. The only sure cure is to lower lun queue depth
to avoid over driving storage in the first place.

 

00.0D.00.00 DID_REQUEUE


0x00.0D.00.00
           00   status byte : <{likely} not valid, see other fields>
        00         msg byte : <{likely} not valid, see other fields>
     0D           host byte : DID_REQUEUE -  Requeue command (no immediate retry) also w.o
                                             decrementing the retry count {RHEL5/RHEL6 only}
  00            driver byte : <{likely} not valid, see other fields>

===NOTES===================================================================================

      o 00.0D.00.00 DID_REQUEUE
            + From the original upstream patch that added DID_REQUEUE:
              "We have a DID_IMM_RETRY to require a retry at once, but we could do with
               a DID_REQUEUE to instruct the mid-layer to treat this command in the
               same manner as QUEUE_FULL or BUSY (i.e. halt the submission until
               another command returns ... or the queue pressure builds if there are no
               outstanding commands)."
            + So, REQUEUE is just essentially a delayed retry... rather than immediately
              resubmitting the io, the io is requeued onto the request queue and has to
              drain down to the driver and out to storage only after some current
              outstanding io completes.

Aug 1 08:10:02 hostname kernel: sd 1:0:0:5: SCSI error: return code = 0x000d0000

Advise to apply the following errata, fixed in 5.6 and later.
https://access.redhat.com/errata/RHSA-2011:0017.html

This is a known issue and further described in the following BZ when using lpfc driver.
Bug 627836 - retry rather than fastfail DID_REQUEUE scsi errors with dm-multipath

The kbase article which explains the issue:

0x000d0000 (DID_REQUEUE) SCSI error with Emulex/LPFC driver on RHEL 5

Bug 516303 [Emulex 5.7 bug] lpfc: setting of DID_REQUEUE conditions
Bug 627836 retry rather than fastfail DID_REQUEUE scsi errors with dm-multipath

 

00.0F.00.00 DID_TRANSPORT_FAILFAST


0x00 0F 00 00
           00   status byte : {likely} not valid, see other fields;
        00         msg byte : {likely} not valid, see other fields;
     0F           host byte : DID_TRANSPORT_FAILFAST - transport class fastfailed the io
  00            driver byte : {likely} not valid, see other fields;

=== NOTES ===================================================================================

    + There were transport issues resulting in an inability to communicate or
      send io commands to the target.  For example, within an FC/SAN environment,
      the remote port the io is to be sent to is in the BLOCKED rather than ONLINE
      state.

    + recommend check for link down or high error rates on fibre/parallel scsi.
      network links (iscsi) as these are common causes for this error being
      returned.  That is, this error is hardware status based.

{Transport failfast will only occur if the option is turned on, normally transportation issues
are DID_TRANSPORT_DISRUPTED or similar -- and as such are re-tryable io...but here the
retrys are suppressed in the interest of getting the io back up the io stack... usually
because there is another path to try or lvm mirroring is in use and there is another mirror
that can be rapidly tried. The failfast option is enabled by users, it is not on by default.}

Example:
Dec 7 16:08:38 hostname kernel: sd 1:0:1:0: Unhandled error code
Dec 7 16:08:38 hostname kernel: sd 1:0:1:0: SCSI error: return code = 0x000f0000
Dec 7 16:08:38 hostname kernel: Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK,SUGGEST_OK
Dec 7 16:08:38 hostname kernel: device-mapper: multipath: Failing path 8:96.
Dec 7 16:08:38 hostname kernel: sd 1:0:1:0: Unhandled error code
Dec 7 16:08:38 hostname kernel: sd 1:0:1:0: SCSI error: return code = 0x000f0000
Dec 7 16:08:38 hostname kernel: Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK,SUGGEST_OK

Summary:
The FAILFAST option is enabled so if/when a error is encountered, the io is immediately returned
up the io stack rather then being retried (as it normally would be).

Description:

More/Next Steps:
Gather extended logging information from driver.
Does the problem follow the HBA if it is swapped with another one?
Turn on driver debug/verbose logging to see if more specific information is available.

Recommendations:
Engage storage vendor support, likely storage side issue present.


FAILFAST is set within the transport handlers within the kernel as seen here:

C symbol: DID_TRANSPORT_FAILFAST

File Function Line
0 libiscsi2.c <global> 1453 sc->result = DID_TRANSPORT_FAILFAST << 16;
1 scsi_transport_iscsi2.c iscsi2_session_chkready 351 err = DID_TRANSPORT_FAILFAST << 16;
2 scsi_transport_fc.h fc_remote_port_chkready 689 result = DID_TRANSPORT_FAILFAST << 16;

For example within the fc transport code:

/**

  • fc_remote_port_chkready - called to validate the remote port state

  • prior to initiating io to the port.

  • Returns a scsi result code that can be returned by the LLDD.

  • @rport: remote port to be checked
    **/
    static inline int
    fc_remote_port_chkready(struct fc_rport *rport)
    {
    int result;

      switch (rport->port_state) {
    

:
case FC_PORTSTATE_BLOCKED:
if (rport->flags & FC_RPORT_FAST_FAIL_TIMEDOUT)
result = DID_TRANSPORT_FAILFAST << 16;
else
result = DID_IMM_RETRY << 16;
break;

So, if the portstate is blocked and FAILFAST is off, the code does immediate retries but
if the option is enabled, then a FAILFAST status is returned and the io is not sent.

 

00.11.00.18 RESERVATION CONFLICT + DID_NEXUS_FAILURE


0x00.11.00.18
           18   status byte : SAM_STAT_RESERVATION_CONFLICT - device reserved to another HBA,
                                                              command failed
        00         msg byte : <{likely} not valid, see other fields>
     11           host byte : {RHEL5/6} DID_NEXUS_FAILIRE - permanent nexus failure, retry on
                                                            other paths may yield different
                                                            results
  00            driver byte : <{likely} not valid, see other fields>

===NOTES===================================================================================

o 00.11.00.00 DID_NEXUS_FAILURE             - added into scsi kernel code in 5.8
o 00.00.00.18 SAM_STAT_RESERVATION_CONFLICT - device reserved on another I_T nexus

  + the primary issue is the reservation conflict
  + a reservation conflict will not get better with retries, so the 
    DID_NEXUS_FAILURE is added to the io return code as a result of the
    primary problem: the reservation conflict
  + DID_NEXUS_FAILURE will prevent retrying this command on the current
    path.  Since reservations are typically I_T nexus specific, a different
    path with different I_T (initiator, aka hba/target aka storage port combo) 
    may succeed.  However, the command stands no chance of completing on the
    current path due to the reservation conflict.
  + recommend having customer check configuration to determine what has an
    outstanding reservation on the device.  For example, a failure within
    3rd party tape backup may have left an errant reservation on a tape drive
    (reservation conflicts often most associated with tape devices)

Example:

st2: Error 110018 (sugg. bt 0x0, driver bt 0x0, host bt 0x11).

Summary:

Red Hat Enterprise Linux does not use scsi reservations, so they are being applied
by an application or if in hypervisor such as VMware then possibly by the hypervisor and/or
cluster management software.

A reservation conflict scsi status has been returned by the storage device. The kernel
has also marked the io with DID_NEXUS_FAILURE to prevent retrying the io on the current
path.

Description:

More/Next Steps:

  • customer needs to check configuration and determine why reservation conflict exists
    and how to correct it. Reservations are typically associated with 3rd party applications.
  • can run sg_tur to see what the status is, often the reservation conflict will show up
    as a return status.
  • ls -1c /dev/sd*[!0-9] | sort | xargs -I {} sg_turs -vv {} << disks only
  • ls -1c /dev/sg* | sort | xargs -I {} sg_turs -vv {} << all scsi devices
  • can attempt to see who has the reservation using sg3_utils command sg_persist:
  • ls -1c /dev/sd*[!0-9] | sort | xargs -I {} sg_persist --in -vv -k -d {} << disks only
  • ls -1c /dev/sg* | sort | xargs -I {} sg_persist --in -vv -k -d {} << all scsi devices

Recommendations:

  • customer side issue, needs to review system, applications, and storage to ascertain
    why reservations are present.

DID_NEXUS_FAILURE was added in RHEL5.8 and is present there and in later releases.

In 6.4, DID_NEXUS_FAILURE only shows up in a couple of places.

File Function Line
0 scsi.h 433 #define DID_NEXUS_FAILURE 0x11
1 scsi_error.c scsi_decide_disposition 1585 set_host_byte(scmd, DID_NEXUS_FAILURE);
2 scsi_lib.c __scsi_error_from_host_byte 687 case DID_NEXUS_FAILURE:
3 virtio_scsi.c virtscsi_complete_cmd 148 set_host_byte(sc, DID_NEXUS_FAILURE);

The primary code for the case of reservation conflict/nexus failure is here:

1 scsi_error.c scsi_decide_disposition 1585 set_host_byte(scmd, DID_NEXUS_FAILURE);


    case RESERVATION_CONFLICT:
            sdev_printk(KERN_INFO, scmd->device,
                        "reservation conflict\n");
            set_host_byte(scmd, DID_NEXUS_FAILURE);
            return SUCCESS; /* causes immediate i/o error */

In other words, if we get a reservation conflict, add DID_NEXUS_FAILURE before returning
the io up the io stack. The nexus failure is to cause an immediate io error and prevent
retries on the current device (path) as these will fail until existing reservation removed.

 

06.00.00.00 DRIVER_TIMEOUT


0x06 00 00 00
           00   status byte : {likely} not valid, see other fields;
        00         msg byte : {likely} not valid, see other fields;
     00           host byte : {likely} not valid, see other fields;
 06            driver byte : DRIVER_TIMEOUT

=== NOTES ===================================================================================

    + Commands timing out.  Unless these are continuous, likelihood is that retries
      succeeded.  If that is the case (sporadic timeouts logged), then not major
      issue.

    + Storage didn't complete the io within the currently set timeout period.

    + If timeouts are followed by lun, target, bus, or adapter resets then storage
      issue is fairly serious and storage hardware vendor should be engaged.  Essentially
      these steps are taken when communication to storage is failing for some reason.

    + Different likely causes depending on how many and how often timeouts are being
      logged.  For example, groups or bursts of these across multiple devices 
      indicates a likely storage controller periodic overload -- questions to 
      answer: is storage shared across multiple hosts or just used by this one system 
      (this is not lun sharing, but sharing of the storage nports themselves... something 
      many sysadmins may not know the answer to without engaging their san storage folks).
      Collect iostat/vmstat and maybe blktrace data and review/compare/contrast 
      data before these are logged to during being logged.  Look at driver queue 
      depth -- is storage likely being overdriven?  Engage storage vendor to 
      look at in-controller statistics during reported timeout periods.  The key 
      here is that the timeouts are sporadic and come in bursts... if the bursts 
      are within hours to days or weeks between that implies a controller overdriven issue 
      to be investigated.  Reducing the lun queue depth within the driver can be 
      one way of determining if overdriven storage queues is responsible... but 
      this can impact both latency and throughput which is why a good baseline of 
      iostat/vmstat/blktrace data is important.  Also, grab top -- is there an 
      application that is always being run when the problem occurs vs not occurring? 

    + Its possible the device no longer exists on the SAN but for some reason an 
      RSCN or other notification isn't happening.  In this case the timeouts will 
      be constant against a given device or set of devices.  The timeout will occur 
      anytime the device is accessed.

    + Its possible that virtual nport mapping is in play and the switch has lost 
      the mapping.  For example, Cisco switches have the capability to map hba 
      nport 'ABC' to 'XYZ' so that the storage controller always sees commands 
      coming from 'XYZ' and not 'ABC'.  This allows replacement of an hba, 
      with a new nport identifier say of 'DEF'.  With virtual port 
      mapping you only need to update the map in the switch and not have to go 
      to all the storage controller ports and update them to accept commands from 
      'DEF'... the switch just remaps 'DEF' to 'XYZ' - done.  However, there have 
      been cases where the map is lost during maintenance cycles, map merges being 
      the most common place.  The result is the commands are sent, but without the 
      port mapping, the data/status/results are dropped by the switch resulting 
      in constant timeouts.

Example:
kernel: SCSI error : 0 0 0 4; return code = 0x6000000

Summary:
Command has timed out.

Description:
The IO command was sent to the target, but neither a successful nor unsuccessful status
has been returned within the expected timeout period.

More/Next Steps:
This is an error being returned back from the driver. But its not the driver's fault. In
99% of the cases this is a storage related issue. The command was sent, but no response
was received back from storage within the allotted time (io timeout period).

The default device timeout period is controlled by udev rules
50-udev.rules::
"

sd: 0 TYPE_DISK, 7 TYPE_MOD, 14 TYPE_RBC

sr: 4 TYPE_WORM, 5 TYPE_ROM

st/osst: 1 TYPE_TAPE

sg: 8 changer, [36] scanner

ACTION=="add", SUBSYSTEM=="scsi" , SYSFS{type}=="0|7|14",
RUN+="/bin/sh -c 'echo 60 > /sys$$DEVPATH/timeout'"
ACTION=="add", SUBSYSTEM=="scsi" , SYSFS{type}=="1",
RUN+="/bin/sh -c 'echo 900 > /sys$$DEVPATH/timeout'"
"

...so 60s for disks, cdroms, dvd, and 900s for tapes. For any other devices the default
is 30s (as applied within the kernel at initial device setup). Other udev rules or changes
to the above can alter the default timeout value.

Command timeouts are not necessarily bad, especially only if they happen occasionally.
They become a problem when they occur more often or systemically in that they can impact users.

Different likely causes are dependent upon how many and how often this status is returned.
For example, groups or bursts of these across multiple devices indicates a likely storage
controller periodic overload or other storage controller issue (recovery, configuration,
loads from other attached systems) or possibly temporary san/switch congestion. Some
questions to answer include:

    * is the storage shared with other systems or just used by this system (this is not 
      lun sharing, but sharing of the storage nports themselves... something many customers 
      won't necessarily know the answer to without engaging their san storage support
      folks).
    * has anything been changed within the system or storage lately?
    * how long has the problem been going on?
    * what is its frequency?
    * what effect does this have on your system and its users? 

Attempt a dd command to the sd device(s) in question is one way to ascertain if the device
is dead-dead (aka possibly deleted in storage or virtual port mapping issues) or just mostly
or occasionally dead (congestion). If dead-dead, then no io commands will complete.

Rescan storage or use sg_luns to determine if luns are still present/available.

Recommendations:

  1. check the io timeout value, is it set too short?

  2. is lun queue depth set too high for the storage configuration? reduce for testing.

  3. have the customer engage their storage vendor,

  4. gather baseline data from when the problem is and is not occurring, baseline loading
    information may need to be gathered (iostat, vmstat, top)

  5. turn on additional kernel messages such as extended scsi and/or driver logging, and/or

  6. use a test load script and gather data while trying to induce and reproduce the issue

  7. check hba port statistics in sysfs and have storage admin check similar counters within
    switch.

  8. io timeout value
    Check current timeout values, and if it seems appropriate, increase the timeout value
    to allow storage enough time. Typical io timeout values are 30 or 60 seconds. If
    increasing the timeout value beyond 120 seconds, make sure the task stall logic timer
    is also reset (its default is 120 and if io is allowed to timeout at, say, 150s, then
    false positives can be generated. Nominally setting task stall detect at 2x io timeout
    is a good place to start.) Timeout values in excess of 120 seconds are sometimes used
    within virtual machine environments to compensate for increased latency on shared
    platform hardware.

For example, if the current io timeout is set to 20s, then
setting it 60s may provide relief of temporary storage congestion issues:

echo 60 > /sys/block/sdX/device/timeout
echo 120 > /proc/sys/kernel/hung_task_timeout_secs

  1. lun queue depth
    The default lun queue depth, typically 30-32 for fibre channel storage adapters, can be
    too high for the storage configuration if there are a lot of luns configured and/or the
    storage is shared with other hosts.

Some lun queue depths can be set on-line via /sys. Note in the following example the
queue_depth has read/write access allowing it to be set without reloading the driver.

ls -l /sys/devices/pci0000:00/0000:00:05.0/0000:1f:00.0/host0/rport-0:0-2/target0:0:2/0:0:2:0/queue_depth

-rw-r--r--. 1 root root 4096 Aug 13 09:55 /sys/devices/pci0000:00/0000:00:05.0/0000:1f:00.0/host0/rport-0:0-2/
target0:0:2/0:0:2:0/queue_depth

cat /sys/devices/pci0000:00/0000:00:05.0/0000:1f:00.0/host0/rport-0:0-2/target0:0:2/0:0:2:0/queue_depth

32

echo 16 > /sys/devices/pci0000:00/0000:00:05.0/0000:1f:00.0/host0/rport-0:0-2/target0:0:2/0:0:2:0/queue_depth

cat /sys/devices/pci0000:00/0000:00:05.0/0000:1f:00.0/host0/rport-0:0-2/target0:0:2/0:0:2:0/queue_depth

16

However, the preferred method is to set the default within the driver via config options files
so the value is picked up at boot time. See
"What is the HBA queue depth, how to check the current queue depth value and how to change the value?"
for instructions on changing lun queue depth on lpfc and qla2xxx drivers.

  1. vendor support
    Ultimately the storage hardware vendor may need to be engaged to ascertain the root cause
    of io timeouts if they continue or are frequent. If a hardware vendor ticket is opened,
    post the ticket number from the vendor in any Red Hat case to allow us to engage with the
    vendor if need be.

  2. baseline
    If the problem is not continuous, the baseline data of iostat/vmstat/top will be needed
    if further analysis is desired or possible. See "[Troubleshooting] Gathering system baseline resource
    usage for IO performance issues"
    for a script that can be used to gather system
    resource information for review. Data should be gathered and submitted both from time
    periods when the problem is not occurring and when it is. Typically about 1 hour of
    data each is a reasonable amount to review.

  3. turn on extended logging
    This step can be used if the storage vendor doesn't find anything and is looking for additional
    information/cooperation on the issue. Typically extended logging for timeout issues won't
    reveal much more in terms of information other than the io timed out within storage.
    See one of the following for appropriate information/instructions:

  1. test load
    Induce a high io load similar to ones observed while the problem is happening from data
    collected in "4. baseline" above. Collect iostat/vmstat/top data for review. This is
    pretty much a last step attempt at inducing the problem to allow studying the issue
    and what triggers it plus if any steps above, like reducing lun queue depth, mitigates
    the issue. Most times it can be difficult to actually induce timeouts within storage
    via a simple load increase.

  2. check port statistics
    The hba port statistics, if supported by the driver/hba, are available in
    /sys/class/fc_host/host*/statistics/*. See "What do the fc statistics under /sys/class/fc_host/hostN/statistics mean" for details.

 

06.00.00.08 DRIVER_TIMEOUT + {SCSI} BUSY


0x06 00 00 08
           08   status byte : SAM_STAT_BUSY - device {returned} busy {status};
        00         msg byte : {likely} not valid, see other fields;
     00           host byte : {likely} not valid, see other fields; 
  06            driver byte : DRIVER_TIMEOUT

=== NOTES ===================================================================================

o 06.00.00.00 DRIVER_TIMEOUT
o 00.00.00.08 SAM_STAT_BUSY - device {returned} busy {status}

  + commands are timing out due to scsi busy status being returned.
  + typically returned by storage target (scsi busy status), but in some
    cases it is "manufactured" by the driver because of hba-based issues.
    o for example, the mptscsih returns this both if the device returns
      this status OR the hba returns busy or insufficient resources (too busy
      with other stuff).  Either of these cases are very likely hardware
      based issues either in the storage device itself, or induced hba 
      issue due to storage device(s) or transport, or in the hba.
  + recommend checking hardware
  + recommend reviewing driver specifics as to if/where SAM_STAT_BUSY is assigned
    as status (typically not done, but can be driver specific).

Example:
kernel: sd 0:0:1:0: timing out command, waited 360s
kernel: sd 0:0:1:0: SCSI error: return code = 0x06000008
kernel: end_request: I/O error, dev sdb, sector 10564536
kernel: Aborting journal on device dm-4.
kernel: ext3_abort called.
kernel: EXT3-fs error (device dm-4): ext3_journal_start_sb: Detected aborted journal
kernel: Remounting filesystem read-only

Summary:
The SAM_STAT_BUSY is the scsi status code back from the target (disk, device). Typically
a device busy should be a transitory issue that corrects itself in time. If it doesn't,
then essentially this is a device timeout condition -- the device failed to respond within
a reasonable amount of time over some number of retries.

Description:
From the SCSI spec:

Hex Description
08 BUSY Indicates the target is busy. Returned whenever a target is
unable to accept a command from an otherwise acceptable
initiator.

So, in this case we're getting back constant device busy status from storage and cannot
get the command completed resulting in eventually giving up as a timeout condition.

More/Next Steps:
This above is very common; a filesystem journal write couldn't complete so the filesystem
has no choice but to remount the filesystem readonly to protect filesystem integrity. The
timeout in this case was 360s or 6 minutes! This is no short term congestion issue but
some type of non-responsive hardware issue -- need to check hardware which is a common
theme for any timeout issues.

Recommendations:
Engage storage h/w support to determine cause of device busy/io timeouts.

If the problem is logged rarely and the system continues, then these can mostly be
ignored. The issue is due to a temporary congestion issue which clears quickly.

If they are frequent to the point that io fails due to constant device busy status, or
the busy status is being logged frequently, then engage your storage hardware support group
to determine root cause of storage hardware returning device busy status. If the issue
is due to storage load, reducing the lun queue depth may provide some relief by lowering
the overall io load placed on storage by the host.

 

06.0D.00.00 DRIVER_TIMEOUT + DID_REQUEUE


0x06.0D.00.00
           00   status byte : <{likely} not valid, see other fields>
        00         msg byte : <{likely} not valid, see other fields>
     0D           host byte : DID_REQUEUE -  Requeue command (no immediate retry) also w.o
                                             decrementing the retry count {RHEL5/RHEL6 only}
  06            driver byte : DRIVER_TIMEOUT

===NOTES===================================================================================

o 06.0D.00.00 DRIVER_TIMEOUT + REQUEUE
    + Commands timing out and are being requeued for retry.  REQUEUE is slightly
      different than an immediate retry effort in that the requeue means the io
      can be delayed before it retried.  Also REQUEUE does not decrement the
      retry count (so this didn't get counted against total maximum retries).
    + typical transient in nature, if not happening a lot or constantly, then
      io is succeeding after being requeued.
    + see <a href="#06.00.00.00">0x06000000</a> and/or <a href="#00.0D.00.00">0x000D0000</a> for more
      information on timeouts and requeue, respectively.

0x060d0000 means driver timeout with requeue. If there are no subsequent errors for the same
device and they're running RHEL5.6 or later, then the i/o will have succeeded with a retry.

Example:
kernel: sd 3:0:0:15: timing out command, waited 30s
kernel: sd 3:0:0:15: SCSI error: return code = 0x060d0000 << return code.
kernel: Result: hostbyte=DID_REQUEUE driverbyte=DRIVER_TIMEOUT,SUGGEST_OK

Summary:
Command has timed out, being requeued rather than immediately retried.

Description:
The IO command was sent to the target, but neither a successful nor unsuccessful status
has been returned within the expected timeout period. If no subsequent events for the
same device are logged, then the io has succeeded upon being retried.

More/Next Steps:
This is an error being returned back from the driver. But its not the driver's fault. In
99% of the cases this is a storage related issue. The command was sent, but no response
was received within the allotted time.

Additional steps that could be taken would be to

  1. check the io timeout value, is it set too short?
  2. have the customer engage their storage vendor,
  3. gather baseline data from when the problem is and is not occurring,
  4. turn on additional kernel messages such as extended scsi and/or driver logging, and/or
  5. baseline loading information may need to be gathered (iostat, vmstat, top)

Recommendations:

Check current timeout values, and if it seems appropriate, increase the timeout value
to allow storage enough time. Typical io timeout values are 30 or 60 seconds. If
increasing the timeout value beyond 120 seconds, make sure the task stall logic timer
is also reset (its default is 120 and if io is allowed to timeout at, say, 150s, then
false positives can be generated. Nominally setting task stall detect at 2x io timeout
is a good place to start.) For example, if the current io timeout is set to 20s, then
setting it 60s may provide relief of temporary storage congestion issues:

echo 60 > /sys/block/sdX/device/timeout
echo 120 > /proc/sys/kernel/hung_task_timeout_secs

Request step 2 and get the vendor case number posted within the ticket. We'll need this
when engaging the vendor from our side of things -- if it comes to that.

Request step 3 if the problem is not continuous, the baseline data of iostat/vmstat/top
will be needed to make further analysis possible.

Request step 4 only if the storage vendor doesn't find anything and is looking for
additional information/cooperation on the issue.

Offer step 5, as a means of collecting baseline data against a known load condition.

 

08.00.00.02 DRIVER_SENSE + {SCSI} CHECK_CONDITION


0x08 00 00 02
           02   status byte : SAM_STAT_CHECK_CONDITION - check returned sense data, esp. 
                                                         ASC/ASCQ
        00         msg byte : {likely} not valid, see other fields;
     00           host byte : {likely} not valid, see other fields;
  08            driver byte : DRIVER_SENSE {scsi sense buffer available from target}

===NOTES===================================================================================

    + status indicates command was returned by the target (disk in this case),
      with a scsi CC (Check Condition) status byte.  The driver byte indicates
      that a sense buffer is available for this command and should be consulted
      for sense key as well as asc/ascq information from the target that will 
      provide more information on the issue of why the io wasn't completed 
      successfully\,  Often this sense buffer information is already decoded and
      output within the messages file.  For example an 04/02 asc/ascq combination
      means "Not Ready, manual intervention required" and can show up within
      the messages this way (already interpreted).  The sense key may also
      be interpreted and displayed, as in:
          "kernel: sdas: Current: sense key: Aborted Command"
      The important thing to note is that this information is coming from the 
      storage target vs from the kernel or its driver.

    + sense buffer includes three key pieces of information:
        - sense key
            - additional sense code (ASC)
	- additional sense qualifier (ASCQ)
      The codes within these three fields are defined by the scsi standard, although
      some value ranges are defined for use by vendors and vendor unique/specific
      codes.  This can make interpretation of the data difficult.

    + the '02' is scsi status returned from the target device, within the sense 
      buffer was additional information that should indicate what condition is 
      being reported. 
            # sense key: Aborted Command, if this is reported then the sense 
                         key within the sense buffer is Bh
            # sense key: Unit Attention, if this is reported then the sense key 
                         within the sense buffer is 6h and there will be additional 
                         information within the sense buffer asc/ascq fields that 
                         will be reported in the messages log file\,
            # See https://access.redhat.com/articles/363594 for more information 
            on sense keys and sense buffers in general.
            # If the sense key was aborted command, then this means that the 
            target aborted the command based upon a request from the initiator. 
            This is not necessarily an error condition that you need to be concerned 
            about -- but you do need to ascertain why the kernel was requesting 
            the command to be aborted\,  Common reasons include 

            * temporary loss of transport, for example link down/link up -- 
                 upon returning all outstanding commands are aborted because the
                 kernel doesn't know if it missed a response while the transport was down

Example:
Jun 29 19:07:55 hostname kernel: sd 0:0:0:106: SCSI error: return code = 0x08000002
Jun 29 19:07:55 hostname kernel: sdf: Current: sense key: Aborted Command < sense key = Bh
Jun 29 19:07:55 hostname kernel: Add. Sense: Internal target failure
Jun 29 19:07:56 hostname kernel: end_request: I/O error, dev sdf, sector 41143440

May 30 17:16:30 hostname kernel: sd 1:0:0:14: SCSI error: return code = 0x08000002
May 30 17:16:30 hostname kernel: sdp: Current: sense key: Aborted Command < sense key = Bh
May 30 17:16:30 hostname kernel: <<vendor>> ASC=0xc0 ASCQ=0x0ASC=0xc0 ASCQ=0x0
May 30 17:16:30 hostname kernel: end_request: I/O error, dev sdp, sector 13250783

Dec 21 06:37:00 hostname kernel: sd 2:0:0:287: SCSI error: return code = 0x08000002
Dec 21 06:37:00 hostname kernel: sdfq: Current: sense key: Hardware Error < sense key = 4h
Dec 21 06:37:00 hostname kernel: Add. Sense: Internal target failure
Dec 21 06:37:00 hostname kernel:
Dec 21 06:37:00 hostname kernel: end_request: I/O error, dev sdfq, sector 81189199

Mar 21 16:41:14 hostname kernel: sd 4:0:2:29: SCSI error: return code = 0x08000002
Mar 21 16:41:14 hostname kernel: sdkv: Current: sense key: Not Ready < sense key = 2h
Mar 21 16:41:14 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required

Jun 11 17:11:26 hostname kernel: sd 3:0:1:0: SCSI error: return code = 0x08000002
Jun 11 17:11:26 hostname kernel: sdd: Current: sense key: Illegal Request < sense key = 5h
Jun 11 17:11:26 hostname kernel: <<vendor>> ASC=0x94 ASCQ=0x1ASC=0x94 ASCQ=0x1
Jun 11 17:11:26 hostname kernel: Buffer I/O error on device sdd, logical block 0

There are 16 different scsi sense key combinations of which 7 are not likely or reserved values. The 9
codes you might encounter are:

Sense Key
0h NO SENSE.
1h RECOVERED ERROR.
2h NOT READY.
3h MEDIUM ERROR.
4h HARDWARE ERROR.
5h ILLEGAL REQUEST.
6h UNIT ATTENTION.
7h DATA PROTECT.
Bh ABORTED COMMAND.

In each of the above cases there is a different problem present. For example in the hardware
error case, the disk had failed. In the not ready case, the io was being sent to a passive
path.

The ASC/ASCQ codes are decoded when they are scsi standard defined, otherwise if they are vendor
specific codes then they are output with "vendor" lines as shown above.

Summary:
Device has returned back a CHECK CONDITION (CC) SCSI status and a sense buffer.

Description:
Since the scsi command status returned from the target is BUSY status, this is truly device is
busy problem and not a transport issue. Sometimes a device goes busy when the storage
controller is reset or if the device itself is undergoing an internal reset and has not
finished yet. Usually retries will ride through these busy times and eventually complete ok. If
not, then the target device/storage controller should be reviewed as to why this issue is
occurring. The IO command has encountered some type of issue within target device resulting in
the command completing but not successfully. The driver byte (0x08) indicates that a sense
buffer from the target device IS available. The sense buffer will have a sense key,
additional sense code (ASC) and additional sense code qualifier (ASCQ) that more fully explain
the issue.

More/Next Steps:
This is an error being returned back from the target and usually indicates a target failure
of some type. Please refer to the sense key and ASC/ASCQ lines within messages for more
details. For example, if this is an aborted command that was requested because of a timeout,
then the root cause is the timeout which should be investigated further. If the abort was
unsolicited by the host, then storage should be engaged to review and address the issue.

Recommendations:

  1. have the customer engage their storage vendor,
  2. turn on additional kernel messages such as extended scsi or driver logging, and/or
  3. baseline loading information may need to be gathered (iostat, vmstat, top)

 

08.07.00.02 DRIVER_SENSE + DID_ERROR + {SCSI} CHECK_CONDITION


0x08.07.00.02
           02   status byte : SAM_STAT_CHECK_CONDITION - check returned sense data, esp. 
                                                         ASC/ASCQ
        00         msg byte : <{likely} not valid, see other fields>
     07           host byte : DID_ERROR - internal error
  08            driver byte : DRIVER_SENSE {scsi sense buffer available from target}

===NOTES===================================================================================

o 08.07.00.02 DRIVER_SENSE + DID_ERROR + {SCSI} CHECK_CONDITION 

Example:

kernel: sd 1:0:0:2: SCSI error: return code = 0x08070002
kernel: sdab: Current: sense key: Medium Error
kernel: Add. Sense: Unrecovered read error
kernel:
kernel: end_request: I/O error, dev sdab, sector 83235271
kernel: device-mapper: multipath: Failing path 65:176.
multipathd: dm-8: add map (uevent)
multipathd: dm-8: devmap already registered

...and another case...

kernel: sd 3:0:8:15: Unhandled sense code
kernel: sd 3:0:8:15: SCSI error: return code = 0x08070002
kernel: Result: hostbyte=DID_ERROR driverbyte=DRIVER_SENSE,SUGGEST_OK
kernel: sdfu: Current: sense key: Aborted Command
kernel: Add. Sense: Data phase error

Summary:
IO failing with medium error, or other target type failure against device.

Description:
The IO command is being failed with a device status of medium error in the above example.
Medium errors are target based error and are not retriable. Namely, if the disk media is bad
then there is no chance that trying down a different path will result in success -- if the
media is bad then transport path is immaterial.

The odd thing is the DID_ERROR in this case. The DID_ERROR is set internal to the
driver upon detecting an anomaly within the target provided status information.
For example, getting a scsi SUCCESS status, not having data underrun set BUT! having the
residual byte count be non-zero. In this case the driver questions the validity of all
information given that the various components of the status don't jibe with one another. The
status is essentially saying it was successful but didn't transfer all the data requested...
that is not the definition of success.

More/Next Steps:
Engage storage vendor. A media error appears to be present and needs to be addressed at the
hardware level. Typically the disk(s) will need to be physically replaced in this case.

Review the specific driver being used to see under what circumstances the DID_ERROR status
is set and returned to help better understand the circumstances of the reported error event.
The DID_ERROR is more of a curiosity than anything else. A review of when DID_ERROR is set
within the qla2xxx and lpfc drivers found no cases where DID_ERROR was set AND the scsi status,
sense key, or asc/ascq codes were modified. That is the scsi CHECK CONDITION, medium error
sense key (3h) and unrecoverable read error asc/ascq code (11h/00h) are all from the target
device. Ditto for the data phase error.


The DID_ERROR is the only thing troubling in this case, its clearly an issue that the
device has a media error and thus the disk needs to be physically replaced. The DID_ERROR
is likely an artifact/secondary issue due to the primary issue (whatever the check condition
is for).

 

08.10.00.02 DRIVER_SENSE + TARGET_FAILURE + {SCSI} CHECK_CONDITION


0x08.10.00.02
           02   status byte : SAM_STAT_CHECK_CONDITION - check returned sense data, esp. 
                                                         ASC/ASCQ
        00         msg byte : {likely} not valid, see other fields
     10           host byte : {RHEL5/6} DID_TARGET_FAILURE - permanent target failure, do not
                                                             retry other paths {set via sense
                                                             info review}
  08            driver byte : DRIVER_SENSE {scsi sense buffer available from target}

===NOTES===================================================================================

o 08.10.00.02 DRIVER_SENSE + DID_TARGET_FAILURE + {SCSI} CHECK_CONDITION 

Example:

kernel: sd 6:0:2:0: SCSI error: return code = 0x08100002
kernel: Result: hostbyte=invalid driverbyte=DRIVER_SENSE,SUGGEST_OK
kernel: sde: Current: sense key: Medium Error
kernel: Add. Sense: Record not found

Summary:
The IO is failing with sense information that flags the device as in a permanent error state.

Description:
One of the following sense key and asc/ascq combos are returned by the target causing
the DID_TARGET_FAILURE host byte to be set. The hostbyte decode table within constants.c
is currently missing the DID_TARGET_FAILURE within the decode table which is why you might
see output in messages of "hostbyte=invalid" as in above. BZ has been opened against that issue.

A DID_TARGET_FAILURE means the current error is considered a permanent target failure and
no other retries on other paths will be attempted.

DID_TARGET_FAILURE set under the following circumstances:

  1. Only certain scsi sense keys are processed by the scsi stack. If any of the
    following scsi sense keys are returned, then the device is considered dead
    and a hostbyte of DID_TARGET_FAILURE is set.

     . key 7h, DATA PROTECT.
     . key 8h, BLANK CHECK.
     . key Ah, COPY ABORTED.
     . key Dh, VOLUME OVERFLOW.
     . key Eh, MISCOMPARE.
    
     From the SCSI specification, these sense keys are describes as follows.
    
     Sense Key
     7h           DATA PROTECT.  Indicates that a command that reads or writes the medium was
                  attempted on a block that is protected from this operation.  The read or
                  write operation is not performed.
    
     8h           BLANK CHECK. Indicates that a write-once device or a sequential access
                  device encountered blank medium or format-defined end-of-data indication
                  while reading or a write-once device encountered a non-blank medium while
                  writing.
    
     Ah           COPY ABORTED.  Indicates a COPY, COMPARE, or COPY AND VERIFY command was
                  aborted due to an error condition on the source device, destination
                  device, or both.
    
     Dh           VOLUME OVERFLOW.  Indicates that a buffered peripheral device has reached
                  the end-of-partition and data may remain in the buffer that has not been
                  written to the medium.  A RECOVERED BUFFERED DATA command(s) may be issued
                  to read the unwritten data from the buffer.
    
     Eh           MISCOMPARE.  Indicates that the source data did not match the data read
                  from the medium.
    
  2. The sense key is MEDIUM ERROR (3h) plus any of the following additional sense code (asc) is
    set within the sense buffer:

     . 0x11/xx - Unrecovered read error
     . 0x13/xx - AMNF data field
     . 0x14/xx - record not found {the specific code within the example above} 
    
  3. Set if HARDWARE ERROR sense key (4h) and there is no retries allowed for hardware
    errors per the scmd->device->retry_hwerror counter.

More/Next Steps:
Engage storage vendor. The root cause is either an odd sense key from the target is being returned (see #1 above), a medium error/asc combo per #2 or a hardware error and no retries on
hardware errors is allowed for this device (which is typical) per #3.

Typically the cause of the odd sense, media error or hardware error from the device(s) will
need to be addressed.

Recommendations:
Engage storage h/w support.

 

References

See Content from tldp.org is not included.Content from tldp.org is not included.http://tldp.org/HOWTO/archived/SCSI-Programming-HOWTO/SCSI-Programming-HOWTO-21.html for additional background information.

 

Resources

Content from osdir.com is not included.Content from osdir.com is not included.http://osdir.com/ml/scsi/2003-01/msg00364.html

Category
Components
Article Type