"Connected to corosync but quorum using qdevice is distrusted for SBD as qdevice-sync_timeout (30s) > watchdog-timeout (5s)" message occurs.

Solution Verified - Updated 13 Jun 2024

Environment

Red Hat Enterprise Linux 8 with High-Availability Add-on
qdevice
sbd

Issue

The following message occurs.

Sep 20 10:35:09 node01 sbd[2343]:   cluster:  warning: set_servant_health: Connected to corosync but quorum using qdevice is distrusted for SBD as qdevice-sync_timeout (30s) > watchdog-timeout (5s).

Cluster nodes using qdevice are shut down when sbd disk is disconnected even if the cluster partition is quorate and pacemaker service is running normally.

Sep 20 10:47:03 node01 sbd[2331]: warning: inquisitor_child: Servant /dev/disk/by-path/ip-10.10.10.13:3260-iscsi-iqn.2022-09.com.test:lun0 is outdated (age: 4)
Sep 20 10:47:03 node01 sbd[2341]: /dev/disk/by-path/ip-10.10.10.13:3260-iscsi-iqn.2022-09.com.test:lun0:    error: header_get: Unable to read header from device 4
Sep 20 10:47:03 node01 sbd[2341]: /dev/disk/by-path/ip-10.10.10.13:3260-iscsi-iqn.2022-09.com.test:lun0:    error: servant_md: No longer found a valid header on /dev/disk/by-path/ip-10.10.10.13:3260-iscsi-iqn.2022-09.com.test:lun0
Sep 20 10:47:09 node01 sbd[2331]: warning: inquisitor_child: Latency: No liveness for 4s exceeds watchdog warning timeout of 3s (healthy servants: 0)
Sep 20 10:47:09 node01 sbd[2331]: warning: inquisitor_child: Latency: No liveness for 4s exceeds watchdog warning timeout of 3s (healthy servants: 0)

However, according to the following sbd man page, sbd will not self-fence in this case.

       If the Pacemaker integration is activated, "sbd" will not self-fence if device majority is lost, if:

       1.  The partition the node is in is still quorate according to the CIB;

       2.  it is still quorate according to Corosync's node count;

       3.  the node itself is considered online and healthy by Pacemaker.

Resolution

Set qdevice-sync_timeout to a value less than SBD_WATCHDOG_TIMEOUT.
For example:

$ grep SBD_WATCHDOG_TIMEOUT /etc/sysconfig/sbd 
SBD_WATCHDOG_TIMEOUT=5

$ cat /etc/corosync/corosync.conf
...
quorum {
    provider: corosync_votequorum

    device {
        model: net
        votes: 1
        sync_timeout: 3000                           <---

        net {
            algorithm: ffsplit
            host: qdevice
        }
    }
}
...

Root Cause

If you have configured your system to run a quorum device and the value of SBD_WATCHDOG_TIMEOUT is less than the value of qdevice-sync_timeout, a quorum state update could be delayed for so long that it would result in a split-brain situation. If that happens, sbd could self-fence. Please also see Design Guidance for RHEL High Availability Clusters - sbd Considerations.

SBR

Clusterha

Product(s)

Red Hat Enterprise Linux

Components

cluster

Category

Troubleshoot

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.