Is there a watchdog script for fence_scsi to reboot a RHEL High Availability or Resilient Storage cluster node?

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux (RHEL) 6 Update 2 or later with the High Availability or Resilient Storage
  • Red Hat Enterprise Linux (RHEL) 7, 8, 9 with High Availability or Resilient Storage

Issue

  • Is there a watchdog script for Red Hat Enterprise Linux Server (with the High Availability or Resilient Storage Add-Ons) to reboot a cluster node?

Resolution

Configuration of the Watchdog Service for fence_scsi in RHEL 6

This configuration assumes that you are using the fence_scsi agent and it is correctly configured in the /etc/cluster/cluster.conf file.

  1. Install the watchdog package.
# yum -y install watchdog
  1. Link the fence_scsi_check.pl script to the /etc/watchdog.d/ directory.
# ln -s /usr/share/cluster/fence_scsi_check.pl /etc/watchdog.d/

Starting with fence-agents-3.1.5-48.el6, an alternate script is offered that will issue a "hard" reboot on the node that was fenced rather than relying on watchdog's reboot procedure which may be prone to becoming blocked when using GFS2 or device-mapper-multipath. To use this version of the watchdog script, link it to the watchdog.d directory.:

# ln -s /usr/share/cluster/fence_scsi_check_hardreboot.pl /etc/watchdog.d/

NOTE: Only one of the above scripts should be linked in the watchdog.d directory

  1. Enable and start the watchdog service
# chkconfig watchdog on
# service watchdog start
  1. Restart the cman service and after it starts up, unfencing should have completed successfully and therefore the node will be registered with the appropriate devices. The local cluster node's key should be stored in the fence_scsi.key file. A list of devices that were successfully registered are stored in the fence_scsi.dev file. Note that if either of these files are empty or do not exist, the fence_scsi_check watchdog script will exit immediately and no reboot will be triggered.
# service cman start
  1. The fence_scsi_check.pl or fence_scsi_check_hardreboot.pl watchdog script should trigger a reboot when a cluster node has been successfully fenced via the fence_scsi agent. To test this, simply use the fence_node utility. The cluster node that was fenced should reboot itself.
# fence_node <nodename>

Configuration of the Watchdog Service for fence_scsi in RHEL 7, 8 or 9

This configuration assumes that you are using the fence_scsi agent and it is correctly configured as a stonith device in your Pacemaker cluster.

  1. Install the watchdog package.
# yum -y install watchdog
  1. Link the fence_scsi_check.pl script to the /etc/watchdog.d/ directory. In RHEL 7 and higher, the single script functions to either gracefully shutdown an environment (generally not recommended as graceful shutdowns can hang and prevent appropriate failover) or to abruptly reboot the environment (generally recommended) depending on what you name the symlinked file. Below are examples for both:
# ln -s /usr/share/cluster/fence_scsi_check /etc/watchdog.d/fence_scsi_check_hardreboot
# ln -s /usr/share/cluster/fence_scsi_check /etc/watchdog.d/
  1. Enable the watchdog service
# systemctl enable watchdog
# systemctl start watchdog
  1. Test fencing and ensure the node reboots and unfences appropriately

Caveat


*The approach with watchdog-daemon and check-script makes a node that is being scsi-fenced actually reboot - being observed by a watchdog-device even in cases where the target node is hanging so that it isn't able to execute the check-script anymore. There is no mechanism though that checks if the target-node had enough time to check if the key had been taken away (or if the node should have been taken down by the watchdog-device meanwhile). Thus it is unsafe to recover generic resources that need to be prevented from split-brain but don't involve the scsi-device reservation is being done on. The watchdog-device in this approach is being used to have a node rebooted as reliably as possible, but it isn't responsible for preventing a split-brain as with [SBD](/articles/2800691). Thus requirements on a watchdog-device are less stringent here as with an [SBD](/articles/2800691) scenario.*

Root Cause

With the release of RHEL 6.2, fence_scsi was This content is not included.integrated with watchdog so that a support software watchdog timer can reboot a cluster node when it is fenced by fence_scsi.

The package watchdog is a general timer service available in RHEL that can be used to periodically monitor system resources. Fence agents have now been integrated with watchdog such that the watchdog service can reboot a cluster node after it has been fenced using fence_scsi. This eliminates the need for manual intervention to reboot the cluster node after it has been fenced using fence_scsi.

The purpose of the watchdog script, fence_scsi_check, is for a node to reboot itself when it has been fenced via the fence_scsi agent. Use of this script is optional and disabled by default.

The fence_scsi_check watchdog script works by tracking the devices that a node registered with and the key used for those registrations. A list of devices that were successfully registered is stored in the /var/run/cluster/fence_scsi.dev file. The local clusternode's key is stored in the /var/run/cluster/fence_scsi.key file. The watchdog script will periodically check that the local clusternode's registration key is registered with at least one of the devices. If the watchdog script determines that the local clusternode's key is not registered with any of the devices, the watchdog will fire an event and the clusternode will be rebooted.

NOTE: Using a hardware watchdog timer as a power fencing device is different from the watchdog script mechanism described here. Using a watchdog-timer device (WDT) can be accomplished in pacemaker clusters with sbd.

For more information on fence_scsi:

Similar functionality is offered for fence_mpath and can be found here:

Additional Resources
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.