The `watchdog` failed to perform a hardreboot with error `child process did not return in time`

Solution Verified - Updated 12 Sept 2025

Environment

Red Hat Enterprise Linux 8, 9, 10
High Availability Pacemaker cluster
Fence agent used with watchdog script: fence_scsi or fence_mpath

Issue

After configuring the watchdog script and when attempting to fence the node, the following error is reported:

Jul 16 02:46:42 node01 watchdog[5208]: test-binary /etc/watchdog.d/fence_scsi_check_hardreboot exceeded time limit
Jul 16 02:47:43 node01 watchdog[5208]: test-binary /etc/watchdog.d/fence_scsi_check_hardreboot exceeded time limit 60
Jul 16 02:47:43 node01 watchdog[5208]: Retry timed-out at 61 seconds for /etc/watchdog.d/fence_scsi_check_hardreboot
Jul 16 02:47:43 node01 watchdog[5208]: repair binary /etc/watchdog.d/fence_scsi_check_hardreboot returned 247 = 'child process did not return in time'

Watchdog unable to reboot the node after fencing using fence_scsi/fence_mpath

Resolution

In the file /etc/watchdog.conf, update the value for retry-timeout and test-timeout to be greater than the value of : retry * retry-sleep * Number of shared devices.

Then restart the watchdog service:

# systemctl restart watchdog

Example: If the Cluster setup has 10 shared storage devices and the following entries are present:

# cat /etc/sysconfig/stonith 
retry=3
retry-sleep=2
verbose=yes    # optional

Then the value for retry-timeout and test-timeout will be as:

Base timeout = 3 (retry) * 2 (retry-sleep) * 10 (Number of shared devices) = 60
Configure the value greater than the base timeout value (say 70 seconds)

The configured values will be as below (Post update of the file, restart watchdog service):

# cat /etc/watchdog.conf
...
retry-timeout        = 70
test-timeout        = 70

Note: If the shared devices are being added, ensure that the value is tuned depending on the new calculation value. The value to these parameters in /etc/watchdog.conf file can be set to a larger value than calculated so to avoid frequent calculations/update in the configuration. Considering the same above example, the values can be set to 240sec as well. This value does not mean cluster will wait for the complete 240sec window, rather this is the maximum window for which the watchdog will wait before reporting a timed out event.

Root Cause

As a part of watchdog script configuration, the following settings were implemented:

The watchdog rebooted a node after the fence_scsi/fence_mpath binary failed with a return code of 1 in a Red Hat High Availability cluster
```
# cat /etc/sysconfig/stonith 
retry=3
retry-sleep=2
verbose=yes    # optional
```

From the source code of fence_scsi agent:

# cat /usr/sbin/fence_scsi
...
def scsi_check(hardreboot=False):
        if len(sys.argv) >= 3 and sys.argv[1] == "repair":
                return int(sys.argv[2])
        options = {}
        options["--sg_turs-path"] = "/usr/bin/sg_turs"
        options["--sg_persist-path"] = "/usr/bin/sg_persist"
        options["--power-timeout"] = "5"
        options["retry"] = "0"
        options["retry-sleep"] = "1"
        options = scsi_check_get_options(options)
        if "verbose" in options and options["verbose"] == "yes":
                logging.getLogger().setLevel(logging.DEBUG)
        devs = dev_read(fail=False)
        if not devs:
                logging.error("No devices found")
                return 0
        key = get_key(fail=False)
        if not key:
                logging.error("Key not found")
                return 0
        for dev in devs:  <<<===---- [1]
                for n in range(int(options["retry"]) + 1):
                        if n > 0:
                                logging.debug("retry: " + str(n) + " of " + options["retry"]) 
                        if key in get_registration_keys(options, dev, fail=False):
                                logging.debug("key " + key + " registered with device " + dev)
                                return 0   <<<==----- [2]
                        else:
                                logging.debug("key " + key + " not registered with device " + dev) 

                        if n < int(options["retry"]):   <<==--- [3]
                                time.sleep(float(options["retry-sleep"]))  

        logging.debug("key " + key + " registered with any devices")   <<<===---- [4]

        if hardreboot == True:    <<<===---- [5]
                libc = ctypes.cdll['libc.so.6']
                libc.reboot(0x1234567)
        return 2

[1] The fence agent validates the SCSI reservation key with all the devices reported by dev_read function with value being kept in devs variable.
[2] When the SCSI reservation key is found on the first device itself, the function completes and exits with return 0
[3] If the loop count is less than the retry, then the function waits for retry-sleep window (as configured in /etc/sysconfig/stonith).

The default value for watchdog timeout is 60s:

# cat /etc/watchdog.conf
...
#retry-timeout        = 60
#test-timeout        = 60

So when the valiation of the SCSI reservation keys on all the shared devices does not complete within 60s window, the loop begins from the first device again. This is primarily because the watchdog service exits the script at 60s and before the complete cycling through all devices happens. Due to this, the script does not reach the execution of [4] and [5]. Hence a hard reboot of the cluster node does not happen upon node getting fenced.

Diagnostic Steps

Configure the cluster to use IO fence agent (either fence_scsi or fence_mpath) along with the respective watchdog script to perform the hard reboot:
- Is there a watchdog script for fence_scsi to reboot a RHEL High Availability or Resilient Storage cluster node?
- Is there a watchdog script for fence_mpath to reboot a RHEL High Availability or Resilient Storage cluster node?
Configure the parameters as detailed in following KCS which is necessary for fence_scsi_check_hardreboot watchdog script:
- The watchdog rebooted a node after the fence_scsi/fence_mpath binary failed with a return code of 1 in a Red Hat High Availability cluster

While testing the fence of the node, the following event gets reported:

The key status before fencing:

# cat /var/log/watchdog/test-bin.stderr
...
DEBUG:root:key 1 registered with device /dev/mapper/mpath-shared01
INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared01

DEBUG:root:0   PR generation=0x7af,     8 registered reservation keys follow:
    0x1
    0x2
    0x1
    0x2
    0x1
    0x2
    0x1
    0x2

As soon as the node is fenced, the key corresponding to the fenced node gets revoked:

DEBUG:root:key 1 registered with device /dev/mapper/mpath-shared01
INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared01

DEBUG:root:0   PR generation=0x7b0,     4 registered reservation keys follow:
    0x2
    0x2
    0x2
    0x2

It can be noticed that the looping starts for all the shared devices:

DEBUG:root:retry: 1 of 3
INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared01  <<===--- First device
...
DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared01
DEBUG:root:retry: 2 of 3
INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared01
...
DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared01
DEBUG:root:retry: 3 of 3
INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared01
...
DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared01
INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared02  <<===--- Second device
...
DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared02
DEBUG:root:retry: 1 of 3
INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared02
...
DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared02
DEBUG:root:retry: 2 of 3
INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared02
...
DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared02
DEBUG:root:retry: 3 of 3
INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared02
...
DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared10
DEBUG:root:retry: 3 of 3
INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared10  <<<===--- 10th device is attempted the 3rd time
...
DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared11  <<<==---- 11th device in the list is verified
INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared01  <<==--- then it loops back to first device

With the values in /etc/sysconfig/stonith, the retry attempt for all the devices does not complete within the default window of 60s.
As soon it reaches the 11th device, the 60sec window is exceeded and watchdog logs the following failure event:

Jul 16 02:46:42 node01 watchdog[5208]: test-binary /etc/watchdog.d/fence_scsi_check_hardreboot exceeded time limit
Jul 16 02:47:43 node01 watchdog[5208]: test-binary /etc/watchdog.d/fence_scsi_check_hardreboot exceeded time limit 60
Jul 16 02:47:43 node01 watchdog[5208]: Retry timed-out at 61 seconds for /etc/watchdog.d/fence_scsi_check_hardreboot
Jul 16 02:47:43 node01 watchdog[5208]: repair binary /etc/watchdog.d/fence_scsi_check_hardreboot returned 247 = 'child process did not return in time'

If the cluster is managing a GFS2 filesystem, then it is likely that the fenced node will go in hung state.

SBR

Clusterha

Product(s)

Red Hat Enterprise Linux for x86_64

Components

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.