The `watchdog` failed to perform a hardreboot with error `child process did not return in time`
Environment
- Red Hat Enterprise Linux 8, 9, 10
- High Availability Pacemaker cluster
- Fence agent used with watchdog script:
fence_scsiorfence_mpath
Issue
-
After configuring the watchdog script and when attempting to fence the node, the following error is reported:
Jul 16 02:46:42 node01 watchdog[5208]: test-binary /etc/watchdog.d/fence_scsi_check_hardreboot exceeded time limit Jul 16 02:47:43 node01 watchdog[5208]: test-binary /etc/watchdog.d/fence_scsi_check_hardreboot exceeded time limit 60 Jul 16 02:47:43 node01 watchdog[5208]: Retry timed-out at 61 seconds for /etc/watchdog.d/fence_scsi_check_hardreboot Jul 16 02:47:43 node01 watchdog[5208]: repair binary /etc/watchdog.d/fence_scsi_check_hardreboot returned 247 = 'child process did not return in time' -
Watchdog unable to reboot the node after fencing using fence_scsi/fence_mpath
Resolution
In the file /etc/watchdog.conf, update the value for retry-timeout and test-timeout to be greater than the value of : retry * retry-sleep * Number of shared devices.
Then restart the watchdog service:
# systemctl restart watchdog
Example: If the Cluster setup has 10 shared storage devices and the following entries are present:
# cat /etc/sysconfig/stonith
retry=3
retry-sleep=2
verbose=yes # optional
Then the value for retry-timeout and test-timeout will be as:
- Base timeout = 3 (
retry) * 2 (retry-sleep) * 10 (Number of shared devices) = 60 - Configure the value greater than the base timeout value (say 70 seconds)
The configured values will be as below (Post update of the file, restart watchdog service):
# cat /etc/watchdog.conf
...
retry-timeout = 70
test-timeout = 70
- Note: If the shared devices are being added, ensure that the value is tuned depending on the new calculation value. The value to these parameters in
/etc/watchdog.conffile can be set to a larger value than calculated so to avoid frequent calculations/update in the configuration. Considering the same above example, the values can be set to 240sec as well. This value does not mean cluster will wait for the complete 240sec window, rather this is the maximum window for which thewatchdogwill wait before reporting a timed out event.
Root Cause
As a part of watchdog script configuration, the following settings were implemented:
-
# cat /etc/sysconfig/stonith retry=3 retry-sleep=2 verbose=yes # optional
From the source code of fence_scsi agent:
# cat /usr/sbin/fence_scsi
...
def scsi_check(hardreboot=False):
if len(sys.argv) >= 3 and sys.argv[1] == "repair":
return int(sys.argv[2])
options = {}
options["--sg_turs-path"] = "/usr/bin/sg_turs"
options["--sg_persist-path"] = "/usr/bin/sg_persist"
options["--power-timeout"] = "5"
options["retry"] = "0"
options["retry-sleep"] = "1"
options = scsi_check_get_options(options)
if "verbose" in options and options["verbose"] == "yes":
logging.getLogger().setLevel(logging.DEBUG)
devs = dev_read(fail=False)
if not devs:
logging.error("No devices found")
return 0
key = get_key(fail=False)
if not key:
logging.error("Key not found")
return 0
for dev in devs: <<<===---- [1]
for n in range(int(options["retry"]) + 1):
if n > 0:
logging.debug("retry: " + str(n) + " of " + options["retry"])
if key in get_registration_keys(options, dev, fail=False):
logging.debug("key " + key + " registered with device " + dev)
return 0 <<<==----- [2]
else:
logging.debug("key " + key + " not registered with device " + dev)
if n < int(options["retry"]): <<==--- [3]
time.sleep(float(options["retry-sleep"]))
logging.debug("key " + key + " registered with any devices") <<<===---- [4]
if hardreboot == True: <<<===---- [5]
libc = ctypes.cdll['libc.so.6']
libc.reboot(0x1234567)
return 2
- [1] The fence agent validates the SCSI reservation key with all the devices reported by
dev_readfunction with value being kept indevsvariable. - [2] When the SCSI reservation key is found on the first device itself, the function completes and exits with
return 0 - [3] If the loop count is less than the
retry, then the function waits forretry-sleepwindow (as configured in/etc/sysconfig/stonith).
The default value for watchdog timeout is 60s:
# cat /etc/watchdog.conf
...
#retry-timeout = 60
#test-timeout = 60
So when the valiation of the SCSI reservation keys on all the shared devices does not complete within 60s window, the loop begins from the first device again. This is primarily because the watchdog service exits the script at 60s and before the complete cycling through all devices happens. Due to this, the script does not reach the execution of [4] and [5]. Hence a hard reboot of the cluster node does not happen upon node getting fenced.
Diagnostic Steps
-
Configure the cluster to use IO fence agent (either
fence_scsiorfence_mpath) along with the respective watchdog script to perform the hard reboot: -
Configure the parameters as detailed in following KCS which is necessary for
fence_scsi_check_hardrebootwatchdog script: -
While testing the fence of the node, the following event gets reported:
- The key status before fencing:
# cat /var/log/watchdog/test-bin.stderr ... DEBUG:root:key 1 registered with device /dev/mapper/mpath-shared01 INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared01 DEBUG:root:0 PR generation=0x7af, 8 registered reservation keys follow: 0x1 0x2 0x1 0x2 0x1 0x2 0x1 0x2- As soon as the node is fenced, the key corresponding to the fenced node gets revoked:
DEBUG:root:key 1 registered with device /dev/mapper/mpath-shared01 INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared01 DEBUG:root:0 PR generation=0x7b0, 4 registered reservation keys follow: 0x2 0x2 0x2 0x2- It can be noticed that the looping starts for all the shared devices:
DEBUG:root:retry: 1 of 3 INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared01 <<===--- First device ... DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared01 DEBUG:root:retry: 2 of 3 INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared01 ... DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared01 DEBUG:root:retry: 3 of 3 INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared01 ... DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared01 INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared02 <<===--- Second device ... DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared02 DEBUG:root:retry: 1 of 3 INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared02 ... DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared02 DEBUG:root:retry: 2 of 3 INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared02 ... DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared02 DEBUG:root:retry: 3 of 3 INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared02 ... DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared10 DEBUG:root:retry: 3 of 3 INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared10 <<<===--- 10th device is attempted the 3rd time ... DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared11 <<<==---- 11th device in the list is verified INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared01 <<==--- then it loops back to first device-
With the values in
/etc/sysconfig/stonith, the retry attempt for all the devices does not complete within the default window of 60s. -
As soon it reaches the 11th device, the 60sec window is exceeded and
watchdoglogs the following failure event:
Jul 16 02:46:42 node01 watchdog[5208]: test-binary /etc/watchdog.d/fence_scsi_check_hardreboot exceeded time limit Jul 16 02:47:43 node01 watchdog[5208]: test-binary /etc/watchdog.d/fence_scsi_check_hardreboot exceeded time limit 60 Jul 16 02:47:43 node01 watchdog[5208]: Retry timed-out at 61 seconds for /etc/watchdog.d/fence_scsi_check_hardreboot Jul 16 02:47:43 node01 watchdog[5208]: repair binary /etc/watchdog.d/fence_scsi_check_hardreboot returned 247 = 'child process did not return in time' -
If the cluster is managing a GFS2 filesystem, then it is likely that the fenced node will go in hung state.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.