Is there a watchdog script for fence_mpath to reboot a RHEL High Availability or Resilient Storage cluster node?
Environment
- Red Hat Enterprise Linux 7, 8, 9, 10 (with the High Availability Add-on)
Issue
- Is there a watchdog script for Red Hat Enterprise Linux Server (with the High Availability or Resilient Storage Add-Ons) to reboot a cluster node that has been fenced by
fence_mpath?
Resolution
Configuration of the Watchdog Service for fence_mpath in RHEL 7
This configuration assumes that you are using the fence_mpath agent and it is correctly configured as a stonith device in your Pacemaker cluster.
NOTE: Ensure the system is up to date in order to avoid a known bug that prevents the watchdog script from working when SELinux is in Enforcing mode.
-
Install
fence-agents-4.2.1-11.el7or later, which has the watchdog feature forfence_mpath.# yum -y install fence-agents-4.2.1-11.el7 -
Install the
watchdogpackage.# yum -y install watchdog -
For a hard reboot, create a wrapper executable in the
/etc/watchdog.ddirectory to call/usr/share/cluster/fence_mpath_check_hardreboot. (Note: a previous version of this solution contained an instruction to create a symlink to/usr/share/cluster/fence_mpath_check_hardreboot. This can result in AVC denials from SELinux, due to execution under thefenced_tcontext instead of thewatchdog_unconfined_tcontext.)# cat > /etc/watchdog.d/fence_mpath_check_hardreboot << EOF #!/bin/sh exec /usr/share/cluster/fence_mpath_check_hardreboot "\${@:-}" EOF # chmod +x /etc/watchdog.d/fence_mpath_check_hardreboot -
Enable and start the
watchdogservice.# systemctl enable watchdog # systemctl start watchdog -
Test fencing and ensure that the node reboots and that it is unfenced appropriately the next time the cluster starts there.
Root Cause
With the release of RHEL 7.6, fence_mpath was integrated with watchdog so that a support software watchdog timer can reboot a cluster node when it is fenced by fence_mpath. Earlier versions of fence_mpath does not have watchdog feature.
The package watchdog is a general timer service available in RHEL that can be used to periodically monitor system resources. Fence agents have now been integrated with watchdog such that the watchdog service can reboot a cluster node after it has been fenced using fence_mpath. This eliminates the need for manual intervention to reboot the cluster node after it has been fenced using fence_mpath.
The purpose of the watchdog script, fence_mpath_check_hardreboot, is for a node to reboot itself when it has been fenced via the fence_mpath agent. Use of this script is optional and disabled by default.
Similar functionality is offered for fence_scsi and can be found here:
Additional Resources
- The watchdog rebooted a node after the fence_scsi/fence_mpath binary failed with a return code of 1 in a Red Hat High Availability cluster
- The
watchdogfailed to perform a hardreboot with errorchild process did not return in time
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.