Unresponsive storage device leads to excessive SCSI recovery and device-mapper-multipath failover times in RHEL
Environment
- Red Hat Enterprise Linux (RHEL) 6
- Red Hat Enterprise Linux (RHEL) 5
- device-mapper-multipath
Issue
-
My multipath device is taking a long time to switch to another path when a storage failure occurs
-
How can I configure device-mapper-multipath and SCSI devices to fail over more quickly, so that there is minimal disruption to I/O during a path failure?
-
A non-responsive SCSI target which gives no transport/link or other errors but just times out commands will trigger SCSI error recovery logic, which may take a long time, and this blocks dm-multipath from failing to another path. Such excessive time may render an expensive high-availability dual-fabric configuration ineffective, and application timeouts may be triggered
-
How can I prevent applications (such as Red Hat High Availability Cluster, Oracle RAC, etc) that place a timeout on disk I/O from timing out while waiting for a SCSI or multipath device to fail?
Resolution
Red Hat Enterprise Linux 6
Update to kernel-2.6.32-131.21.1.el6 (from RHSA-2011:1465) or later.
Status
Several tunable parameters exist to decrease the amount of time it takes the kernel layers to handle errors of this type. You can adjust SCSI device timeouts, transport timeouts, and queue depths so that SCSI error recovery logic completes in a shorter amount of time. The downside to this approach is that it may reduce performance under some circumstances.
1. Adjust the SCSI device timeout values for all path devices. for example:
# echo 20 > /sys/block/<device>/device/timeout
Where <device> is the storage device, such as sdc, sdd, etc. These settings are not persistent, so they can be put in /etc/rc.local to apply them whenever the system boots.
2. Set transport timeouts. For fiber-channel, set dev_loss_tmo. There are two ways to do this:
-
Limit it system wide in /etc/modprobe.conf (RHEL 5) or /etc/modprobe.d/scsi.conf (RHEL 6) by putting this entry
options scsi_transport_fc dev_loss_tmo=10 -
Or change the value manually via sysfs for the desired rports:
# echo 10 > /sys/class/fc_remote_ports/rport-0:0-0/dev_loss_tmo # echo 10 > /sys/class/fc_remote_ports/rport-1:0-0/dev_loss_tmo
Again, these are not persistent, so setting them via /etc/rc.local can apply them on every boot.
3. Limit the queue depth per LUN on each host. The queue depth setting depends on the driver. For the qla2xxx driver, change it persistently by putting the following line in /etc/modprobe.conf (RHEL 5) or /etc/modprobe.d/scsi.conf (RHEL 6):
options qla2xxx ql2xmaxqdepth=32
For the lpfc driver, set it as follows:
options lpfc lpfc_lun_queue_depth=32
After reloading the module the value can be checked in the 7th column of
# cat /proc/scsi/sg/devices
0 0 0 0 0 1 31 0 1
1 0 0 0 0 1 31 0 1
2 0 0 0 0 1 31 0 1
If you have changed the /etc/modprobe.conf (RHEL 5) or /etc/modprobe.d/* (RHEL 6) files in any of these steps, you should rebuild the initrd (RHEL5) or initramfs (RHEL 6) and reboot.
In error recovery, most SCSI error recovery stages send a TUR (Test Unit Ready) command for every bad command when a driver error handler reports success. When several bad commands pointed to a same device, the device was probed multiple times. When the device was in a state where the device did not respond to commands even after a recovery function returned success, the error handler had to wait for the commands to time out. This significantly impeded the recovery process. With the release of kernel kernel-2.6.32-220.el6 and later versions, SCSI mid-layer error routines to send test commands have been fixed to respond once per device instead of once per bad command, thus reducing error recovery time considerably.
Root Cause
When a path is unresponsive, the SCSI layer and driver must go through a process of retrying any outstanding commands and initiating their error handlers, before they can throw an error back up to the multipath layer. This process is needed to ensure data integrity, as illustrated by the following example.
Consider the case where you have 2 paths in a path group. When all is well, multipath will round robin between these two paths, switching every rr_min_io I/O operations. Now, say that the path we're currently sending I/O down suddenly stops responding. We don't get a link error or an RSCN, but instead we just get no response to the commands that were issued. If those commands had a 30s SCSI command timeout, we must wait that long before taking further action. If we didn't wait that long, and decided to send I/O down the other path, then consider if the original path "woke up" and all of a sudden processed the I/O operations which had not yet timed out because 30s had not passed. Since we've already sent I/O down the alternate path and potentially moved on to other I/O operations, this double-processing of the same I/Os could cause corruption. (These considerations are quite similar to those that underlie the need for fencing in clustering software design - when a participant "appears" to have gone down, you need to make sure it is down before you can recover)
If the FC network doesn't tell us a device is unreachable, there are O(N) sections to the maximum time until a command can be tried down another path.
Tf = timeout on the first command to experience a timeout
Tl = longest command timeout running when Tf occurs
Nc = Number of commands which timed out
Nd = Number of devices with timed out commands
Xa = Adapter's time to abort a command
Xd = Adapter's time to reset a device
Xb = Adapter's time to reset its bus connection
Xh = Adapter's time for a full reset
Tf +Tl + Nc * (10 + Xa) + Nd * (10 + Xd) + (Xb + dev_loss_tmo) + (Xh + 10)
The larger the number of commands and the number of devices with bad commands, the larger the delay. The 10s numbers in the equation is from the error handler sending TUR commands with 10 second timeouts to check if the device has be restored to a working state.
The issue would have been worse without the fix for Bugzilla #553042 (cf. RHBA-2010:0255-1). But setting a very low checker timeout doesn't shorten the timeouts of I/O commands which also affect the total time the error handler takes to run.
Before the error handlers will run, all commands sent to a particular SCSI host must have either timed out or completed. Since commands could be submitted until when a command times out, this means the worst case delay is the timeout of the first command to timeout plus the timeout of the last command to timeout. With the I/O timeout set to 60s, that means there's a worst case of 120s before the error handler would ever be able to run. Having a short polling_interval and checker_timout in /etc/multipath.conf can reduce the first timeout period, but won't reduce the last timeout period.
In the logs, there may be occurances of the error handler having to run do to timeouts. Normally, the commands are recovered without a problem. However, the occurance of one of these non-fatal abort sequences shortly before a fatal one is one of the reasons multipath couldn't retry the command in time to avoid the vote disk timeout.
For more information on this topic and another example of a situation where these long timeouts might be seen, visit device-mapper-multipath on RHEL5 experiences excessive delay in detecting a lost path from a storage failure that produces no RSCN or loop/link error.
The Storage Administration Guides for This content is not included.RHEL 5 and This content is not included.RHEL 6 also contain useful information on administering storage devices and timeouts.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.