The `watchdog` failed to perform a hardreboot with error `child process did not return in time`

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 8, 9, 10
  • High Availability Pacemaker cluster
  • Fence agent used with watchdog script: fence_scsi or fence_mpath

Issue

  • After configuring the watchdog script and when attempting to fence the node, the following error is reported:

    Jul 16 02:46:42 node01 watchdog[5208]: test-binary /etc/watchdog.d/fence_scsi_check_hardreboot exceeded time limit
    Jul 16 02:47:43 node01 watchdog[5208]: test-binary /etc/watchdog.d/fence_scsi_check_hardreboot exceeded time limit 60
    Jul 16 02:47:43 node01 watchdog[5208]: Retry timed-out at 61 seconds for /etc/watchdog.d/fence_scsi_check_hardreboot
    Jul 16 02:47:43 node01 watchdog[5208]: repair binary /etc/watchdog.d/fence_scsi_check_hardreboot returned 247 = 'child process did not return in time'
    
  • Watchdog unable to reboot the node after fencing using fence_scsi/fence_mpath

Resolution

In the file /etc/watchdog.conf, update the value for retry-timeout and test-timeout to be greater than the value of : retry * retry-sleep * Number of shared devices.

Then restart the watchdog service:

# systemctl restart watchdog

Example: If the Cluster setup has 10 shared storage devices and the following entries are present:

# cat /etc/sysconfig/stonith 
retry=3
retry-sleep=2
verbose=yes    # optional

Then the value for retry-timeout and test-timeout will be as:

  • Base timeout = 3 (retry) * 2 (retry-sleep) * 10 (Number of shared devices) = 60
  • Configure the value greater than the base timeout value (say 70 seconds)

The configured values will be as below (Post update of the file, restart watchdog service):

# cat /etc/watchdog.conf
...
retry-timeout        = 70
test-timeout        = 70
  • Note: If the shared devices are being added, ensure that the value is tuned depending on the new calculation value. The value to these parameters in /etc/watchdog.conf file can be set to a larger value than calculated so to avoid frequent calculations/update in the configuration. Considering the same above example, the values can be set to 240sec as well. This value does not mean cluster will wait for the complete 240sec window, rather this is the maximum window for which the watchdog will wait before reporting a timed out event.

Root Cause

As a part of watchdog script configuration, the following settings were implemented:

From the source code of fence_scsi agent:

# cat /usr/sbin/fence_scsi
...
def scsi_check(hardreboot=False):
        if len(sys.argv) >= 3 and sys.argv[1] == "repair":
                return int(sys.argv[2])
        options = {}
        options["--sg_turs-path"] = "/usr/bin/sg_turs"
        options["--sg_persist-path"] = "/usr/bin/sg_persist"
        options["--power-timeout"] = "5"
        options["retry"] = "0"
        options["retry-sleep"] = "1"
        options = scsi_check_get_options(options)
        if "verbose" in options and options["verbose"] == "yes":
                logging.getLogger().setLevel(logging.DEBUG)
        devs = dev_read(fail=False)
        if not devs:
                logging.error("No devices found")
                return 0
        key = get_key(fail=False)
        if not key:
                logging.error("Key not found")
                return 0
        for dev in devs:  <<<===---- [1]
                for n in range(int(options["retry"]) + 1):
                        if n > 0:
                                logging.debug("retry: " + str(n) + " of " + options["retry"]) 
                        if key in get_registration_keys(options, dev, fail=False):
                                logging.debug("key " + key + " registered with device " + dev)
                                return 0   <<<==----- [2]
                        else:
                                logging.debug("key " + key + " not registered with device " + dev) 

                        if n < int(options["retry"]):   <<==--- [3]
                                time.sleep(float(options["retry-sleep"]))  

        logging.debug("key " + key + " registered with any devices")   <<<===---- [4]

        if hardreboot == True:    <<<===---- [5]
                libc = ctypes.cdll['libc.so.6']
                libc.reboot(0x1234567)
        return 2
  • [1] The fence agent validates the SCSI reservation key with all the devices reported by dev_read function with value being kept in devs variable.
  • [2] When the SCSI reservation key is found on the first device itself, the function completes and exits with return 0
  • [3] If the loop count is less than the retry, then the function waits for retry-sleep window (as configured in /etc/sysconfig/stonith).

The default value for watchdog timeout is 60s:

# cat /etc/watchdog.conf
...
#retry-timeout        = 60
#test-timeout        = 60

So when the valiation of the SCSI reservation keys on all the shared devices does not complete within 60s window, the loop begins from the first device again. This is primarily because the watchdog service exits the script at 60s and before the complete cycling through all devices happens. Due to this, the script does not reach the execution of [4] and [5]. Hence a hard reboot of the cluster node does not happen upon node getting fenced.

Diagnostic Steps

  1. Configure the cluster to use IO fence agent (either fence_scsi or fence_mpath) along with the respective watchdog script to perform the hard reboot:

  2. Configure the parameters as detailed in following KCS which is necessary for fence_scsi_check_hardreboot watchdog script:

  3. While testing the fence of the node, the following event gets reported:

    • The key status before fencing:
    # cat /var/log/watchdog/test-bin.stderr
    ...
    DEBUG:root:key 1 registered with device /dev/mapper/mpath-shared01
    INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared01
    
    DEBUG:root:0   PR generation=0x7af,     8 registered reservation keys follow:
        0x1
        0x2
        0x1
        0x2
        0x1
        0x2
        0x1
        0x2
    
    • As soon as the node is fenced, the key corresponding to the fenced node gets revoked:
    DEBUG:root:key 1 registered with device /dev/mapper/mpath-shared01
    INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared01
    
    DEBUG:root:0   PR generation=0x7b0,     4 registered reservation keys follow:
        0x2
        0x2
        0x2
        0x2
    
    • It can be noticed that the looping starts for all the shared devices:
    DEBUG:root:retry: 1 of 3
    INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared01  <<===--- First device
    ...
    DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared01
    DEBUG:root:retry: 2 of 3
    INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared01
    ...
    DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared01
    DEBUG:root:retry: 3 of 3
    INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared01
    ...
    DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared01
    INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared02  <<===--- Second device
    ...
    DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared02
    DEBUG:root:retry: 1 of 3
    INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared02
    ...
    DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared02
    DEBUG:root:retry: 2 of 3
    INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared02
    ...
    DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared02
    DEBUG:root:retry: 3 of 3
    INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared02
    ...
    DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared10
    DEBUG:root:retry: 3 of 3
    INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared10  <<<===--- 10th device is attempted the 3rd time
    ...
    DEBUG:root:key 1 not registered with device /dev/mapper/mpath-shared11  <<<==---- 11th device in the list is verified
    INFO:root:Executing: /usr/sbin/mpathpersist -i -k -d /dev/mapper/mpath-shared01  <<==--- then it loops back to first device
    
    • With the values in /etc/sysconfig/stonith, the retry attempt for all the devices does not complete within the default window of 60s.

    • As soon it reaches the 11th device, the 60sec window is exceeded and watchdog logs the following failure event:

    Jul 16 02:46:42 node01 watchdog[5208]: test-binary /etc/watchdog.d/fence_scsi_check_hardreboot exceeded time limit
    Jul 16 02:47:43 node01 watchdog[5208]: test-binary /etc/watchdog.d/fence_scsi_check_hardreboot exceeded time limit 60
    Jul 16 02:47:43 node01 watchdog[5208]: Retry timed-out at 61 seconds for /etc/watchdog.d/fence_scsi_check_hardreboot
    Jul 16 02:47:43 node01 watchdog[5208]: repair binary /etc/watchdog.d/fence_scsi_check_hardreboot returned 247 = 'child process did not return in time'
    
  4. If the cluster is managing a GFS2 filesystem, then it is likely that the fenced node will go in hung state.

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.