Members of a cluster are showing different "Member status" in clustat output on RHEL 5 or RHEL 6
Environment
- Red Hat Cluster Suite 4+ and Red Hat GFS 6.1
- Red Hat Enterprise Linux Server 5 (with the High Availability and Resilient Storage Add Ons)
- Red Hat Enterprise Linux Server 6 (with the High Availability and Resilient Storage Add Ons)
- qdiskd (quorum disk on each cluster node)
Issue
- Cluster appears to be hung when a cluster node was evicted from the cluster.
GFS1/GFS2filesystem are blocked after a cluster node was evicted from the cluster.- Members of a cluster are showing different "Member status" after a cluster node was evicted:
$ clustat
Cluster Status for GFS2 @ Tue Jul 30 11:07:04 2013
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
node1 1 Offline
node2 2 Online, Local
node3 3 Online
node4 4 Online
/dev/emcpowerh1 0 Online, Quorum Disk
$ clustat
Cluster Status for GFS2 @ Tue Jul 30 11:06:38 2013
Member Status: Inquorate
Member Name ID Status
------ ---- ---- ------
node1 1 Offline
node2 2 Online
node3 3 Online, Local
node4 4 Online
Resolution
If the cluster is configured for last man standing then qdiskd must be running and presenting a quorum disk to each cluster node. A cluster that is configured for last man standing should never participate in a cluster without presenting the quorum disk.
With the release of the errata RHBA-2014-1211 which contains the package cman-2.0.115-124.el5 for the affected channels RHEL Desktop Workstation (v.5 client), RHEL (v.5 server) a warning if one or more of the cluster nodes does not have a quorum disk loaded if one is required.
Root Cause
This cluster was configured for last man standing which requires that qdiskd is running on each cluster nodes. One of the cluster nodes did not have qdiskd running, so it was not presenting a quorum disk that would have given it enough votes to survive a lost of a cluster nodes. When the cluster node was evicted, the cluster node not running qdiskd lost quorum. The other surviving cluster nodes had quorum since qdiskd was running on those cluster nodes.
The lack of quorum on one of the cluster nodes prevented the cluster node from continuing the groupd (cpg) negotiations, this means that cluster will be in a cpg transition state of FAIL_ALL_STOPPED until quorum is regained on all cluster nodes that are in the cluster.
All the surviving cluster nodes will show the following:
# group_tool -v
type level name id state node id local_done
fence 0 default 00010002 FAIL_ALL_STOPPED 1 100030003 -1
[1 2 3 4]
dlm 1 clvmd 00020002 FAIL_ALL_STOPPED 1 100030003 -1
[1 2 3 4]
dlm 1 rgmanager 00030002 FAIL_ALL_STOPPED 1 100030003 -1
[1 2 3 4]
dlm 1 myGFS2vol1 00040001 FAIL_ALL_STOPPED 1 100030003 -1
[1 2 3 4]
gfs 2 myGFS2vol1 00030001 FAIL_ALL_STOPPED 1 100030003 -1
Clustered services will not be relocated(rgmanager will block), clvmd will not respond leading to lvm commands hanging, and GFS1/GFS2 will start to block cause no new locks are handed out. It is also possible that cluster node that was evicted will not be fenced.
The cluster will stay in this state until:
- The evicted cluster node has been fenced off or reboot then rejoins the cluster.
- The
qdiskdservice is started on any cluster node where the daemon was not running.
Diagnostic Steps
- Capture the
clustatoutput and compare theMember Statusvalue when the event occurs. - Review the
psoutput to see ifqdiskdis running on any cluster node not showing aquorum diskinclustatoutput. - Review the
/var/log/messagesfile to see ifqdiskdfailed to start. - Capture the
group_tool -voutput from all the cluster nodes.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.