device-mapper-multipath on RHEL5 experiences excessive delay in detecting a lost path from a storage failure that produces no RSCN or loop/link error

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux (RHEL) 5
  • device-mapper-multipath configured (automatically or through /etc/multipath.conf) to use the tur or readsector0 path checker or the mpath_prio_rdac priority callout.

Issue

  • When a path fails, it is taking over 5 minutes for multipath to switch to another path
  • With an RDAC-based storage array, mpath_prio_rdac priority callouts may take 300 seconds to fail when the storage is unresponsive, delaying path failover.
  • One of two (redundant) Fibre Switches failed and de-zoned the LUNs presented from that switch
  • Although there was a remaining active path, the application timed out waiting for I/O from the voting disks while multipath waited on the SCSI layer to fail the path
  • Any failure on the fabric that does not produce a Register State Change Notification (RSCN) or a loop/link error will take at least 300 seconds to timeout at the scsi layer, causing the mpath map to be unresponsive for that long.
  • Servers may reboot under load

Resolution

  • Update to device-mapper-multipath-0.4.7-34.el5 or later. This release changed the default path checker SCSI command timeout from 300 to 60 seconds, and added a configurable option checker_timeout to control it. It can also be controlled via /sys/block/sdX/device/timeout, and multipath will use this setting as its default checker timeout.
  • One workaround exists:
    • Use the directio path checker (as long as the storage array supports it), as it does not use to the DEF_TIMEOUT as the other path checkers do.
  • Note: Certain Cisco Fiber Switches may require a software upgrade to resolve issues with storage becoming unresponsive (it was reported that a fix may be in NX-OS 4.2(3a) - Contact Cisco Support).
  • If the device-mapper map utilizes the mpath_prio_rdac priority callout, update to device-mapper-multipath-0.4.7-48.el5 or later, which changes the timeout on this callout from 300 seconds to 60 seconds.

Root Cause

The initial issues with long path failovers and timeouts were addressed by Red Hat Engineering via Bugzilla #553042. The issues with mpath_prio_rdac timeouts were addressed via Bugzilla #73072.

Normally when a fiber-channel connected storage device's state changes, the host will be notified via a link event or RSCN (Register State Change Notification), resulting in a rapid change in state in multipath for that particular path. However, in some situations it is possible for a storage device to become completely unresponsive as a result of issues on the host HBA, fabric, switch, storage controller, or array. When this occurs, the multipath map cannot switch to using an alternate path until I/O issued to the device either returns successfully or the SCSI layer times out the I/O and fails recovery. This process involves waking up the SCSI error handler, aborting timed out commands, resetting devices, and requeing commands, which can all take quite a bit of time and depends on the configured timeout settings in the HBA driver, SCSI layer, and device-mapper-multipath.

The above mentioned releases of device-mapper-multipath include a change to the tur and readsector0 path checkers that lower the default command timeout (shortening the maximum path failover time), and offer a new setting called checker_timeout to control this setting. If not specified, this setting will default to the value in /sys/block/sdX/device/timeout for each individual path.

The later release of device-mapper-multipath mentioned above also drops the mpath_prio_rdac SCSI command timeout from 300 seconds to 60 seconds, dropping the amount of time it takes for that command to timeout, which could also speed up path failovers in some situations where that prio_callout is used. See this article for more information on this specific problem.

Diagnostic Steps

In the kernel logs (default location /var/log/messages) the first sign of a problem will be a SCSI Abort message:

Jan  4 20:08:54 myhost.example.com kernel: qla2xxx 0000:25:00.0: scsi(7:0:3): Abort command issued -- 1 4ff31d4 2002.

Depending on how many device-alias entries in the fibre switch are deleted/lost, several other similar messages may also be seen for other SCSI devices. Eventually the device reset is issued by the driver due to it's elevation mechanism:

Jan  4 20:09:14 myhost.example.com kernel: qla2xxx 0000:25:00.0: scsi(7:0:3): DEVICE RESET ISSUED.
Jan  4 20:09:39 myhost.example.com kernel: qla2xxx 0000:25:00.0: qla2xxx_eh_device_reset: device reset failed
Jan  4 20:09:39 myhost.example.com kernel: qla2xxx 0000:25:00.0: scsi(7:0:3): LOOP RESET ISSUED.
Jan  4 20:09:40 myhost.example.com kernel: qla2xxx 0000:25:00.0: qla2xxx_eh_bus_reset: reset succeded

And finally the adapter reset:

Jan  4 20:10:09 myhost.example.com kernel: qla2xxx 0000:25:00.0: scsi(7:0:3): ADAPTER RESET ISSUED.
Jan  4 20:10:09 myhost.example.com kernel: qla2xxx 0000:25:00.0: Performing ISP error recovery - ha= ffff81104c3104f8.
Jan  4 20:10:10 myhost.example.com kernel: qla2xxx 0000:25:00.0: LOOP UP detected (4 Gbps).
Jan  4 20:10:10 myhost.example.com kernel: qla2xxx 0000:25:00.0: qla2xxx_eh_host_reset: reset succeded

Eventually, the scsi midlayer becomes aware of an issue, and also the device mapper multipath:

Jan  4 20:10:20 myhost.example.com kernel: sd 7:0:0:3: scsi: Device offlined - not ready after error recovery
Jan  4 20:10:20 myhost.example.com kernel: sd 7:0:0:3: scsi: Device offlined - not ready after error recovery
Jan  4 20:10:20 myhost.example.com kernel: sd 7:0:0:3: rejecting I/O to offline device
Jan  4 20:10:20 myhost.example.com kernel: device-mapper: multipath: Failing path 8:16.

The period of time that passes from the first sign of trouble, until device mapper multipath's path checker detects any problem in this example is 1 minute 26 seconds.

Another example that illustrates the variable period it can take:

Jan  4 20:12:55 myhost.example.com kernel: qla2xxx 0000:24:00.0: scsi(5:0:3): Abort command issued -- 1 669e398 2002.
Jan  4 20:13:05 myhost.example.com kernel: qla2xxx 0000:24:00.0: scsi(5:0:3): Abort command issued -- 1 669e398 2002.
...
Jan  4 20:13:55 myhost.example.com kernel: qla2xxx 0000:24:00.0: scsi(5:0:3): DEVICE RESET ISSUED.
Jan  4 20:14:20 myhost.example.com kernel: qla2xxx 0000:24:00.0: scsi(5:0:3): LOOP RESET ISSUED.
Jan  4 20:14:21 myhost.example.com kernel: qla2xxx 0000:24:00.0: qla2xxx_eh_bus_reset: reset succeded
...
Jan  4 20:15:05 myhost.example.com kernel: qla2xxx 0000:24:00.0: scsi(5:0:3): ADAPTER RESET ISSUED.
Jan  4 20:15:05 myhost.example.com kernel: qla2xxx 0000:24:00.0: Performing ISP error recovery - ha= ffff81182fc004f8.
Jan  4 20:15:06 myhost.example.com kernel: qla2xxx 0000:24:00.0: LOOP UP detected (4 Gbps).
Jan  4 20:15:06 myhost.example.com kernel: qla2xxx 0000:24:00.0: qla2xxx_eh_host_reset: reset succeded
...
Jan  4 20:15:16 myhost.example.com kernel: sd 5:0:0:3: scsi: Device offlined - not ready after error recovery
Jan  4 20:15:16 myhost.example.com kernel: sd 5:0:0:3: scsi: Device offlined - not ready after error recovery
...
Jan  4 20:15:16 myhost.example.com kernel: sd 5:0:0:3: rejecting I/O to offline device
Jan  4 20:15:16 myhost.example.com kernel: device-mapper: multipath: Failing path 129:192.
SBR
Components

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.