"Connected to corosync but quorum using qdevice is distrusted for SBD as qdevice-sync_timeout (30s) > watchdog-timeout (5s)" message occurs.
Environment
- Red Hat Enterprise Linux 8 with High-Availability Add-on
- qdevice
- sbd
Issue
-
The following message occurs.
Sep 20 10:35:09 node01 sbd[2343]: cluster: warning: set_servant_health: Connected to corosync but quorum using qdevice is distrusted for SBD as qdevice-sync_timeout (30s) > watchdog-timeout (5s). -
Cluster nodes using qdevice are shut down when sbd disk is disconnected even if the cluster partition is quorate and pacemaker service is running normally.
Sep 20 10:47:03 node01 sbd[2331]: warning: inquisitor_child: Servant /dev/disk/by-path/ip-10.10.10.13:3260-iscsi-iqn.2022-09.com.test:lun0 is outdated (age: 4) Sep 20 10:47:03 node01 sbd[2341]: /dev/disk/by-path/ip-10.10.10.13:3260-iscsi-iqn.2022-09.com.test:lun0: error: header_get: Unable to read header from device 4 Sep 20 10:47:03 node01 sbd[2341]: /dev/disk/by-path/ip-10.10.10.13:3260-iscsi-iqn.2022-09.com.test:lun0: error: servant_md: No longer found a valid header on /dev/disk/by-path/ip-10.10.10.13:3260-iscsi-iqn.2022-09.com.test:lun0 Sep 20 10:47:09 node01 sbd[2331]: warning: inquisitor_child: Latency: No liveness for 4s exceeds watchdog warning timeout of 3s (healthy servants: 0) Sep 20 10:47:09 node01 sbd[2331]: warning: inquisitor_child: Latency: No liveness for 4s exceeds watchdog warning timeout of 3s (healthy servants: 0)However, according to the following
sbdman page,sbdwill not self-fence in this case.If the Pacemaker integration is activated, "sbd" will not self-fence if device majority is lost, if: 1. The partition the node is in is still quorate according to the CIB; 2. it is still quorate according to Corosync's node count; 3. the node itself is considered online and healthy by Pacemaker.
Resolution
Set qdevice-sync_timeout to a value less than SBD_WATCHDOG_TIMEOUT.
For example:
$ grep SBD_WATCHDOG_TIMEOUT /etc/sysconfig/sbd
SBD_WATCHDOG_TIMEOUT=5
$ cat /etc/corosync/corosync.conf
...
quorum {
provider: corosync_votequorum
device {
model: net
votes: 1
sync_timeout: 3000 <---
net {
algorithm: ffsplit
host: qdevice
}
}
}
...
Root Cause
If you have configured your system to run a quorum device and the value of SBD_WATCHDOG_TIMEOUT is less than the value of qdevice-sync_timeout, a quorum state update could be delayed for so long that it would result in a split-brain situation. If that happens, sbd could self-fence. Please also see Design Guidance for RHEL High Availability Clusters - sbd Considerations.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.