Limiting path failover time for SCSI devices

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 5.10.z kernel kernel-2.6.18-371.6.1.el5 or newer
  • Red Hat Enterprise Linux 6.4.z kernel kernel-2.6.32-358.32.3.el6 or newer
  • SCSI devices in a multipath configuration
  • Storage faults that cause IO to time out on external storage
  • Red Hat Enterprise Linux 7

Issue

  • A storage fault causes the SCSI error handler (scsi_eh) threads to spend a long time attempting to recover before failing over to other available paths.

Resolution

Red Hat Enterprise Linux 5.10.z and 6.4.z includes several new SCSI and device-driver parameters that can help to limit the overall time spent in SCSI error handling. Setting these parameters to appropriate values may greatly reduce the time taken for the kernel to fail IO on a defective path before re-issuing IO on other available devices.

It is initially set to 0, so the default behavior of the system remains unchanged.
When this timeout expires, the HBA port is reset, and commands are free to re-try on another path. The HBA port reset takes approximately 10 sec.

The following command sets eh_deadline to #VALUE seconds for just the Fibre Channel SCSI hostadapters within the system (execute from /etc/rc.d/rc.local):

    #  for i in $(ls /sys/class/fc_host/); do echo #VALUE > /sys/class/scsi_host/$i/eh_deadline; done 

For all SCSI host adapters you can use:

    #  for i in $(ls /sys/class/scsi_host/*/eh_deadline); do echo #VALUE > $i ; done   [1]

Also, a udev rule can be used: How to set eh_deadline and eh_timeout persistently, using a udev rule


[1] Although all scsi drivers will list eh_deadline as an attribute, not all drivers currently support changing the eh_deadline value. For example the hpsa and usb-storage drivers do, but the ata_piix, ahci and sil_sata24 drivers do not. For drivers that don't support changing the eh_deadline value a message of "echo: write error: Invalid argument" will be displayed.

 

This timer should be set with a margin to allow sufficient time for the HBA port reset, and for queued I/O to complete on an alternate path, for example if an application timeout is configured for 300s (5m) it may be reasonable to set the eh_deadline value to 200s (3m20s). Please test thoroughly these settings before apply them on production systems.

(Note, we are assuming that there is a healthy alternate path, and that any failed paths have already been discovered and taken down, so the I/O will complete promptly on the next path selected.)

This parameter is introduced in latest rhel6.4 kernel and in rhel6.5 onward kernel.

  • eh_timeout (TEST UNIT READY error handler timeout)

The number of seconds to wait for TUR operations issued by the error handling code to respond.

/sys/class/scsi_device/<h:c:t:l>/device/eh_timeout

The default is 10 sec. This timer may run a minimum of four times in the case where the target has gone silent. This could be safely reduced to 5 sec. This will allow adequate time to complete several task management operations (as above), with the associated TEST UNIT READY operations, before the eh_deadline expires.

  • lpfc_task_mgmt_tmo (LPFC task management timeout (ABORT TASK, LUN RESET, TARGET RESET))

The number of seconds to wait for task management operations (ABORT TASK, LUN RESET, TARGET RESET) issued by the lpfc driver to respond.

    #  for i in $(ls /sys/class/scsi_host/*/lpfc_task_mgmt_tmo); do echo #VALUE > $i ; done

The default is 60 sec. This timer may run a minimum of four times, in the case where the target has gone silent. This timeout could be safely reduced to 20 sec. This will allow adequate time to attempt a reset, followed by a TEST UNIT READY, and to retry the process if the failure persists, before the eh_deadline expires.

NOTE Before applying any of these setting on a production system, please make sure the changes are tried in a testing environment and no issues are observed during the tests

Root Cause

When an IO timeout occurs the Linux kernel SCSI error handler logic proceeds through a sequence of recovery methods to attempt to recover failing devices or transports while causing as little disruption to other IO taking place on the system as possible. The standard recovery levels are executed in order with an escalation to the next level whenever a recover attempt fails, or a subsequent SCSI Test Unit Ready (TUR) command fails:

  • Abort timed out commands and attempt to bring device online
  • Issue SCSI device Reset task management function for each failing device
  • Issue SCSI target reset task management function for each failing target
  • Issue SCSI bus reset for each failing bus (emulated as a series of port resets for Fibre Channel environments)
  • Issue SCSI host bus adapter (HBA) reset

Each level of escalation broadens the scope of the recovery attempt and so increases the number of other IO requests and devices that may be affected by the recovery action. By starting with simple command aborts and finally escalating to a full HBA reset the methods are tried in order of increasing cost / disruption.

In a situation where all operations on the external storage time out (for example due to a failed SAN fabric component that has ceased to pass any traffic or report any error condition) this logic can lead to very long delays in failing IO where there are large numbers of devices or targets (since each reset level is repeated for each outstanding command, device, target etc.).

By setting an overall limit on the time spent attempting these operations (and immediately proceeding to the HBA reset if this time expires) the features discussed in this solution provide more consistent and predictable system behavior when faults of this nature occur.

Diagnostic Steps

  • Storage fault present that causes IO and other operations to time out

  • Logs record escalating device, target, bus, and host resets:

Nov 22 17:35:09 server kernel: lpfc 0000:0b:01.1: 1:(0):0713 SCSI layer issued Device Reset (0, 1) return x2002
Nov 22 17:36:19 server kernel: lpfc 0000:0b:01.1: 1:(0):0713 SCSI layer issued Device Reset (0, 4) return x2002
Nov 22 17:37:29 server kernel: lpfc 0000:0b:01.1: 1:(0):0723 SCSI layer issued Target Reset (0, 4) return x2002
Nov 22 17:38:49 server kernel: lpfc 0000:0b:01.1: 1:(0):0714 SCSI layer issued Bus Reset Data: x2002
Nov 22 17:39:20 server kernel: lpfc 0000:0b:01.1: 1:3172 SCSI layer issued Host Reset Data: x2002
Nov 22 17:39:30 server kernel: device-mapper: multipath: Failing path 8:80.
Nov 22 17:39:30 server kernel: device-mapper: multipath: Failing path 8:32.
  • Overall time to fail IO exceeds configured application or cluster timeouts leading to host reboots or application failure
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.