Administrative Procedures for RHEL High Availability Clusters - Validating a Watchdog Timer Device (WDT) to Use with sbd
Contents
- Overview
- Find and understand available watchdog timer device(s)
- Confirm watchdog device halts system under expected conditions
- Review Red Hat Support policies and guidance for
sbd
Overview
Applicable Environments
- Red Hat Enterprise Linux (RHEL) 6, 7, 8 with the High Availability Add-On
pacemakersbdfencing being implemented or considered
Recommended Prior Reading
Useful Guides and References
Introduction
This guide is intended to assist in determining whether a system has a watchdog timer (WDT) device that is suitable for usage with sbd.
sbd's ability to serve as a reliable fence mechanism depends on its integration with such a watchdog timer. Because of this, it is important that servers which will have sbd deployed on them are confirmed to have a watchdog timer device that is capable of carrying out the needed tasks on a system that is to be fenced.
For RHEL 8 see the following: How do I list and test available watchdog devices to use with SBD in a RHEL 8 Pacemaker cluster?
Find and understand available watchdog timer device(s)
Determine if a device is found and ready for use
Begin by checking whether any watchdog devices exist on the servers where sbd will be used.
# # Look for watchdog devices in /dev
# ls -l /dev/watchdog*
crw-------. 1 root root 10, 130 Feb 22 11:43 /dev/watchdog
crw-------. 1 root root 253, 0 Feb 22 11:43 /dev/watchdog0
If a watchdog timer device is available on this system and the kernel has detected it and loaded the necessary driver, there should be a /dev/watchdog device, and there may be one or more individual /dev/watchdogX devices as well. The /dev/watchdog node is the device that should be used for testing and in the configuration for sbd, except perhaps in special circumstances like if there are multiple devices.
If this device is found in /dev, then testing of its functionality can proceed. If no such device exists, then the administrator will need to inspect whether such a hardware device is available on the system, whether the kernel has support for that device, and whether the correct kernel module is loaded. This falls outside the scope of this article, so please search for other content in the Red Hat Customer Portal or contact Red Hat Support for assistance if needed.
Inspect watchdog device and configuration for more information
In some cases there may be a need to understand the type of device that is in use and to explore any settings that may be associated with it. There isn't necessarily a straightforward way to associate the individual /dev/watchdogX devices back to their hardware devices or drivers directly, but dmesg can usually reveal at least which drivers are loaded - which in most cases is sufficient since there is typically only one watchdog device available to the system.
To list watchdog available.
# wdctl
Check for watchdog in dmesg.
# # Most watchdog drivers contain the word "watchdog" in their description
# dmesg | grep -i watchdog
[ 9.336858] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
This finding tells us that the iTCO_wdt module was loaded for a watchdog device - iTCO is an Intel watchdog timer.
If dmesg does not report anything that includes the string "watchdog", it is possible to iterate through the known-watchdog drivers available in the loaded kernel to see if any of them are active on the system:
# # Loop through known watchdog modules and see if lsmod lists any of them as loaded on this system
# for module in /lib/modules/$(uname -r)/kernel/drivers/watchdog/*; do lsmod | grep $(basename ${module%.ko}); done
iTCO_vendor_support 13718 1 iTCO_wdt
iTCO_wdt 13480 0
iTCO_vendor_support 13718 1 iTCO_wdt
Again, on this example server, iTCO_wdt is found to be loaded.
It may be possible that a watchdog device could exist outside of the aforementioned locations, in which case the administrator or vendor will need to further inspect whether it is functional, that it is ready for use, and the properties of it.
Once the watchdog driver is known, it can be further inspected with modinfo
# # Obtain information about the watchdog driver discovered in earlier steps
# modinfo iTCO_wdt
filename: /lib/modules/3.10.0-229.el7.x86_64/kernel/drivers/watchdog/iTCO_wdt.ko
alias: platform:iTCO_wdt
alias: char-major-10-130
license: GPL
version: 1.11
description: Intel TCO WatchDog Timer Driver
author: Wim Van Sebroeck <wim@iguanabe>
rhelversion: 7.1
srcversion: 1BA54686E238A2655AF3C1A
depends: iTCO_vendor_support
intree: Y
vermagic: 3.10.0-229.el7.x86_64 SMP mod_unload modversions
signer: Red Hat Enterprise Linux kernel signing key
sig_key: A3:CB:8C:C3:19:50:4A:B5:2C:FB:76:BA:F8:D8:A2:A7:39:68:9C:56
sig_hashalgo: sha256
parm: heartbeat:Watchdog timeout in seconds. 5..76 (TCO v1) or 3..614 (TCO v2), default=30) (int)
parm: nowayout:Watchdog cannot be stopped once started (default=0) (bool)
parm: turn_SMI_watchdog_clear_off:Turn off SMI clearing watchdog (depends on TCO-version)(default=1) (int)
The parm lines at the end represent configurable parameters, so if any specific functionality is desired out of this device, those can be changed in module configuration files. Keep in mind that any heartbeat or timeout related settings will be controlled by sbd once it is configured, so that does not need to be tuned through these module parameters. However it can be useful to take note of the default value of this setting now (it is 30 seconds in the above example output), for setting expectations in the following test.
Confirm watchdog device halts system under expected conditions
Goal
Determine if the discovered watchdog device can be used within a RHEL High Availability cluster in conjunction with sbd, it must be confirmed to be able halt the system after a lack of update for the configured timeout period.
Test details
The functionality of the watchdog can be demonstrated by simply echoing a value to /dev/watchdog to initiate the countdown and monitoring the result.
Successful result
If the watchdog timer is functional and appropriate for use with sbd, this test will result in the system halting within the timeout period following the initial update of the /dev/watchdog device. This halt should be a hard-shutdown or reboot; the server should not go through any graceful shutdown where services are stopped. The desired result is the system console shows the screen going blank and/or going through the typical hardware initialization screens seen when the server boots up. Any ssh or terminal session that was open to the system should be unresponsive.
Procedure
-
Begin by accessing the system console for the server that will be tested, either locally at the server, through an IP KVM or remote serial console, or through the system management interface. This will give visibility of what the system is doing even if an ssh session is disrupted.
-
In a shell, either through the system console, a terminal, or a remote session, execute an "echo" and redirect it to the selected watchdog device. This can be as simple as:
# echo "test" > /dev/watchdog -
Or alternatively with a countdown to make it more exciting and informative along the way (change the
timeout=30assignment to match the expected timeout or heartbeat for the device, and change/dev/watchdogif needed):# timeout=30; echo; echo; echo "Opening watchdog device to start countdown from $timeout"; echo start > /dev/watchdog; echo "Counting down. The system should halt near 0 if the watchdog is functional."; while true; do if [ $timeout -le 0 ]; then echo "Timeout expired. System should be halting."; else echo "$timeout..."; fi; timeout=$((timeout-1)); sleep 1; done; Opening watchdog device to start countdown from 30 Counting down. The system should halt near 0 if the watchdog is functional. 30... 29... 28... 27... 26... -
Watch to see if the system resets in the expected amount of time. If it doesn't, then the watchdog device may need further inspection to see if it is appropriate. If the system did reset in the proper amount of time, it should be usable with
sbd.
Review Red Hat Support policies and guidance for sbd
With these results in hand, consult Red Hat's other content on sbd to determine whether the device is acceptable for use, to configure the cluster, and to proceed to further testing and use.