fence_scsi_check.pl watchdog script does a soft reboot instead of hard and hangs during shutdown in a RHEL 6 or 7 Resilient Storage cluster with device-mapper-multipath

Solution Unverified - Updated

Environment

  • Red Hat Enterprise Linux (RHEL) 6 or 7 with the High Availability Add On
  • Using SCSI Persistent Reservation Fencing (fence_scsi)
  • Using the fence_scsi_check.pl watchdog script for fence_scsi to reboot a node when fenced
    • RHEL 7:
    • RHEL 6:
      • Using a fence-agents release prior to 3.1.5-48.el6, OR
      • Using fence-agents-3.1.5-48.el6 or later AND /usr/share/cluster/fence_scsi_check.pl is linked or copied to /etc/watchdog.d (as opposed to /usr/share/cluster/fence_scsi_check_hardreboot.pl being linked or copied)
  • device-mapper-multipath
    • The settings for the device in question enable queueing (even if only temporary) when all paths have failed
      • Can be enabled via no_path_retry set to "queue" or a value greater than 0 in /etc/multipath.conf, or in the built-in device settings in multipathd (see /usr/share/doc/device-mapper-multipath-$vers/multipath.conf.defaults)
      • Can be enabled via features "1 queue_if_no_path" in /etc/multipath.conf or built-in device settings in multipathd if no_path_retry is not set.

Issue

  • After manually fencing a node with actively running a resource group, scsi watchdog begins to initiate a reboot but fails to completely reboot the machine.
  • When watchdog reboots a node, it gets stuck shutting down. I see backtraces with it waiting on device mapper or the file system

Resolution

RHEL 7

Update to fence-agents-scsi-4.0.11-27.el7_2.5 or later, and instead of creating a link from /usr/share/cluster/fence_scsi_check to /etc/watchdog.d/fence_scsi_check, create the link to /etc/watchdog.d/fence_scsi_check_hardreboot.

# rm /etc/watchdog.d/fence_scsi_check
# ln -s /usr/share/cluster/fence_scsi_check /etc/watchdog.d/fence_scsi_check_hardreboot

RHEL 6

  • Update to fence-agents-3.1.5-48.el6 or later, and switch from using /usr/share/cluster/fence_scsi_check.pl to /usr/share/cluster/fence_scsi_check_hardreboot.pl. This would require removing /etc/watchdog.d/fence_scsi_check.pl (which is usually a symlink back to the formerly listed script), and creating a new symlink:
# rm /etc/watchdog.d/fence_scsi_check.pl
# ln -s /usr/share/cluster/fence_scsi_check_hardreboot.pl /etc/watchdog.d/

All Releases

  • Workaround: Configure the device to fail immediately on the loss of all paths. Example snippet from /etc/multipath.conf (the exact configuration can vary per environment):
devices {
  device {
      vendor "EXAMPLE"
      no_path_retry fail
      features "0"
  }
}

Or alternatively, the no_path_retry can be kept at a value > 0, but a small value.

Root Cause

A resolution to this problem in RHEL 6 was released in an updated fence-agents package by Red Hat in Bugzilla #1050022. and in a RHEL 7 Update 2 asynchronous erratum via Bugzilla #1292071. Red Hat is further pursuing a release in a minor release of RHEL 7 via Bugzilla #1265426.

When a node is fenced by fence_scsi, subsequent write attempts to those devices controlled by the agent should return a SCSI reservation conflict error, propagating on all paths and resulting in that multipath map having no more remaining active paths. When no paths are remaining and no_path_retry is set to "queue" or a value greater than 0, then that I/O will continue to be queued until the retries are exhausted.

All the while, if the watchdog daemon is configured to reboot the host when its reservation is revoked, such as with fence_scsi_check.pl, then watchdog will start that process potentially before multipathd has flipped the map to stop queueing. As part of watchdog's shutdown process, it will kill any running processes, including multipathd. With multipathd dead, the map will never stop queueing blocked I/O, and watchdog will continue to wait for pending I/O to flush to disk, and thus the shutdown process may never complete.

The new fence-agents package in RHEL 6 mentioned above includes a separate fence_scsi_check_hardreboot.pl script which triggers a reboot that will not go through the normal graceful shutdown routine in watchdog but instead simply hard reboots the system. In the new package for RHEL 7, the /usr/share/cluster/fence_scsi_check script (which is simply a copy of /usr/sbin/fence_scsi) contains code that will cause the system to hard-reboot when fenced if the script that is executed by watchdog is named fence_scsi_check_hardreboot. So in other words, the same script can be used to trigger standard watchdog shutdown facilities or to trigger a hard-reboot, depending on whether it is named /etc/watchdog.d/fence_scsi_check or /etc/watchdog.d/fence_scsi_check_hardreboot.

Diagnostic Steps

  • Configure kdump, reproduce the hang on shutdown when triggered by watchdog, and then force a core dump with Sysrq-C to capture a core. Review the core and see that multipathd is dead, and see in the logs that processes are blocked waiting on I/O from a device that never completed its retries while queueing
SBR
Components

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.