SCSI error: return code = 0x000d0000 with Emulex LPFC module on RHEL 5

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 5.5 or earlier
  • Emulex LPFC HBA with Red Hat supplied modules

Issue

  • /var/log/messages reports

    Oct 23 06:07:40 host kernel: sd 5:0:1:32: SCSI error: return code = 0x000d0000
    Oct 23 06:07:40 host kernel: end_request: I/O error, dev sdcv, sector 2556879
    Oct 23 06:07:40 host kernel: device-mapper: multipath: Failing path 70:48.
    
  • and shortly after

    Oct 23 06:07:49 host multipathd: sdcv: readsector0 checker reports path is up 
    Oct 23 06:07:49 host multipathd: 70:48: reinstated
    

Resolution

  • The IO return code of 0x000d0000 is resolved as follows
0x00.0D.00.00
           00   status byte : <{likely} not valid, see other fields>
        00         msg byte : <{likely} not valid, see other fields>
     0D           host byte : DID_REQUEUE -  Requeue command (no immediate retry) also w.o decrementing the retry count {RHEL5/RHEL6 only}
  00            driver byte : <{likely} not valid, see other fields>
  • From the original upstream patch that added DID_REQUEUE:
    • "We have a DID_IMM_RETRY to require a retry at once, but we could do with a DID_REQUEUE to instruct the mid-layer to treat this command in the same manner as QUEUE_FULL or BUSY (i.e. halt the submission until another command returns ... or the queue pressure builds if there are no outstanding commands)."
    • So, REQUEUE is just essentially a delayed retry... rather than immediately resubmitting the io, the io is requeued onto the request queue and has to drain down to the driver and out to storage only after some current outstanding io completes.

  • This was addressed in RHSA-2011-0017
  • You must upgrade to RHEL 5.6 or later to resolve this issue.
  • This content is not included.Bug 627836 has been posted to change RHEL5.6 so it will retry DID_REQUEUE rather than fast-fail, when used with dm-multipath. This resolves the symptoms in most of the support cases.

Root Cause

  • There are multiple (at least two) root causes for this issue:

  • The LPFC driver/firmware failed to process an I/O and requested it be re-queueued. The exact reason for this is unknown, but presumed to be due to a resource limitation of some sort at the firmware or hardware level has been reached, usually under a heavy I/O load. Retrying the I/O request always seems to succeed.

  • A fibre-channel protocol framing error occurred. These are usually intermittent, so a retry will work-around the issue. This is probably hardware related - the fibre-channel cable should be checked and tested. If this is the root cause, then there will be non-zero counters for one or more of the Emulex HBA error counter statistics, which can be examined with the command:

cat /sys/class/fc_host/host*/statistics/{error_frames,invalid_*}
  • In cases where the problem seems to be correlated with other system or application activity, we need to capture the FC stats over a 48 hour period and correlate with syslog and system wide performance data (especially I/O activity). In this case, a script like the following should be run in the background with the output redirected to a file :
#! /bin/sh
hostname
while true
do
	echo '#' $(date +%F) $(uptime)
	for f in /sys/class/fc_host/host*/statistics/{error_frames,invalid_*}
	do
		[ -f $f ] && echo $f $(cat $f)
	done
	sleep 5
done

Diagnostic Steps

  • Check /var/log/messages
    Oct 23 06:07:40 somekernel kernel: sd 5:0:1:32: SCSI error: return code = 0x000d0000
    Oct 23 06:07:40 somekernel kernel: end_request: I/O error, dev sdcv, sector 2556879
    Oct 23 06:07:40 somekernel kernel: device-mapper: multipath: Failing path 70:48.
    ... one device-mapper-multipth path_checker interval later ...
    Oct 23 06:07:49 somekernel multipathd: sdcv: readsector0 checker reports path is up 
    Oct 23 06:07:49 somekernel multipathd: 70:48: reinstated
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.