The watchdog rebooted a node after the fence_scsi/fence_mpath binary failed with a return code of 1 in a Red Hat High Availability cluster

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 6, 7, 8, 9 (with the High Availability Add-on)

Issue

  • A cluster node with the fence_scsi watchdog script configured was gracefully rebooted. /var/log/messages shows errors like the following:
Mar 28 10:28:08 node1 watchdog[22429]: test binary /etc/watchdog.d/fence_scsi_check_hardreboot returned 1
Mar 28 10:28:08 node1 watchdog[22429]: repair binary /etc/watchdog.d/fence_scsi_check_hardreboot returned 1
Mar 28 10:28:08 node1 watchdog[22429]: shutting down the system because of error 1

Resolution

Red Hat Enterprise Linux 6


There are no plans to fix this issue for RHEL 6.

Red Hat Enterprise Linux 7


Upgrade to [`fence-agents-all-4.2.1-24.el7`](/errata/RHSA-2019:2037) or later and see [workaround](#workaround). This errata or later includes the options `retry` and `retry-sleep` for `/etc/sysconfig/stonith`.

There are no plans to support these options for fence_mpath on RHEL 7.

Red Hat Enterprise Linux 8


See [workaround](#workaround) as RHEL 8 includes the options `retry` and `retry-sleep` for `/etc/sysconfig/stonith`.

Support for fence_mpath using the options retry and retry-sleep in /etc/sysconfig/stonith was added with the following errata RHBA-2021:4148 with the following package(s): fence-agents-4.2.1-75.el8 or later.

Red Hat Enterprise Linux 9 or later


See [workaround](#workaround) as RHEL 9 includes the options `retry` and `retry-sleep` for `/etc/sysconfig/stonith`.

Workaround


Add the following `retry` and `retry-sleep` options to `/etc/sysconfig/stonith` so that the watchdog script is more fault-tolerant. Create the file if it does not already exist.
# cat /etc/sysconfig/stonith 
retry=3
retry-sleep=2
verbose=yes    # optional

Root Cause

The fence_scsi_check or fence_scsi_check_hardreboot watchdog script tests whether the node can access the shared storage managed by the cluster and by a fence_scsi stonith device. A node could lose access to the shared LUNs due to fencing or due to some other issue at the storage level. If the test fails, the node reboots itself.

A return code of 1 indicates that the operation did not complete within 5 seconds.

The scsi_check() function in the watchdog script sets the --power-timeout option to 5.

def scsi_check(hardreboot=False):
    if len(sys.argv) >= 3 and sys.argv[1] == "repair":
        return int(sys.argv[2])
    options = {}
    options["--sg_turs-path"] = "/usr/bin/sg_turs"
    options["--sg_persist-path"] = "/usr/bin/sg_persist"
    options["--power-timeout"] = "5"

The watchdog script uses the fencing library's run_command() function to execute external commands (sg_persist and sg_turs) in the get_registration_key() function. The run_command() function uses the value of --power-timeout as its timeout. Since the fence_scsi_check script sets --power-timeout=5, all external commands are allowed only 5 seconds to complete. If there is latency in accessing storage for any reason, 5 seconds may not be sufficient.

If one of these external commands times out after 5 seconds, run_command() fails. get_registration_key() then exits with a generic error code of 1. When the watchdog daemon receives this return code of 1 from the fence_scsi_check script, it initiates a graceful reboot.

# # # fencing.py ###
def run_command(options, command, timeout=None, env=None, log_command=None):
    if timeout is None and "--power-timeout" in options:
        timeout = options["--power-timeout"]

    logging.info("Executing: %s\n", log_command or command)

    try:
        process = subprocess.Popen(shlex.split(command), stdout=subprocess.PIPE, stderr=subprocess.PIPE, env=env,
                # decodes newlines and in python3 also converts bytes to str
                universal_newlines=(sys.version_info[0] > 2))
    except OSError:
        fail_usage("Unable to run %s\n" % command)

    thread = threading.Thread(target=process.wait)
    thread.start()
    thread.join(timeout)
    if thread.is_alive():
        process.kill()
        fail(EC_TIMED_OUT, stop=(int(options.get("retry", 0)) < 1))
# # # fence_scsi_check # # #
def get_registration_keys(options, dev):
        reset_dev(options,dev)
        keys = []
        cmd = options["--sg_persist-path"] + " -n -i -k -d " + dev
        out = run_cmd(options, cmd)
        if out["err"]:
                fail_usage("Cannot get registration keys")
        for line in out["out"].split("\n"):
                match = re.search(r"\s+0x(\S+)\s*", line)
                if match:
                        keys.append(match.group(1))
        return keys
# # # fencing.py # # #
EC_GENERIC_ERROR = 1
...
def fail_usage(message="", stop=True):
    if len(message) > 0:
        logging.error("%s\n", message)
    if stop:
        logging.error("Please use '-h' for usage\n")
        sys.exit(EC_GENERIC_ERROR)

Diagnostic Steps

  1. Find messages like the following in /var/log/messages.

     Mar 28 10:28:08 node1 watchdog[22429]: test binary /etc/watchdog.d/fence_scsi_check_hardreboot returned 1
     Mar 28 10:28:08 node1 watchdog[22429]: repair binary /etc/watchdog.d/fence_scsi_check_hardreboot returned 1
     Mar 28 10:28:08 node1 watchdog[22429]: shutting down the system because of error 1
    
  2. Review the logs in /var/log/watchdog/, looking for a "Connection timed out" message.

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.