Members of a cluster are showing different "Member status" in clustat output on RHEL 5 or RHEL 6

Solution Unverified - Updated

Environment

  • Red Hat Cluster Suite 4+ and Red Hat GFS 6.1
  • Red Hat Enterprise Linux Server 5 (with the High Availability and Resilient Storage Add Ons)
  • Red Hat Enterprise Linux Server 6 (with the High Availability and Resilient Storage Add Ons)
  • qdiskd (quorum disk on each cluster node)

Issue

  • Cluster appears to be hung when a cluster node was evicted from the cluster.
  • GFS1/GFS2 filesystem are blocked after a cluster node was evicted from the cluster.
  • Members of a cluster are showing different "Member status" after a cluster node was evicted:
$ clustat 
Cluster Status for GFS2 @ Tue Jul 30 11:07:04 2013
Member Status: Quorate

 Member Name      ID   Status
 ------ ----      ---- ------
 node1            1    Offline
 node2            2    Online, Local
 node3            3    Online
 node4            4    Online
 /dev/emcpowerh1  0    Online, Quorum Disk


$ clustat
Cluster Status for GFS2 @ Tue Jul 30 11:06:38 2013
Member Status: Inquorate

 Member Name      ID   Status
 ------ ----      ---- ------
 node1            1    Offline
 node2            2    Online
 node3            3    Online, Local
 node4            4    Online

Resolution

If the cluster is configured for last man standing then qdiskd must be running and presenting a quorum disk to each cluster node. A cluster that is configured for last man standing should never participate in a cluster without presenting the quorum disk.

With the release of the errata RHBA-2014-1211 which contains the package cman-2.0.115-124.el5 for the affected channels RHEL Desktop Workstation (v.5 client), RHEL (v.5 server) a warning if one or more of the cluster nodes does not have a quorum disk loaded if one is required.

Root Cause

This cluster was configured for last man standing which requires that qdiskd is running on each cluster nodes. One of the cluster nodes did not have qdiskd running, so it was not presenting a quorum disk that would have given it enough votes to survive a lost of a cluster nodes. When the cluster node was evicted, the cluster node not running qdiskd lost quorum. The other surviving cluster nodes had quorum since qdiskd was running on those cluster nodes.

The lack of quorum on one of the cluster nodes prevented the cluster node from continuing the groupd (cpg) negotiations, this means that cluster will be in a cpg transition state of FAIL_ALL_STOPPED until quorum is regained on all cluster nodes that are in the cluster.

All the surviving cluster nodes will show the following:

# group_tool -v
type             level name        id       state node id local_done
fence            0     default     00010002 FAIL_ALL_STOPPED 1 100030003 -1
[1 2 3 4]
dlm              1     clvmd       00020002 FAIL_ALL_STOPPED 1 100030003 -1
[1 2 3 4]
dlm              1     rgmanager   00030002 FAIL_ALL_STOPPED 1 100030003 -1
[1 2 3 4]
dlm              1     myGFS2vol1  00040001 FAIL_ALL_STOPPED 1 100030003 -1
[1 2 3 4]
gfs              2     myGFS2vol1  00030001 FAIL_ALL_STOPPED 1 100030003 -1

Clustered services will not be relocated(rgmanager will block), clvmd will not respond leading to lvm commands hanging, and GFS1/GFS2 will start to block cause no new locks are handed out. It is also possible that cluster node that was evicted will not be fenced.

The cluster will stay in this state until:

  • The evicted cluster node has been fenced off or reboot then rejoins the cluster.
  • The qdiskd service is started on any cluster node where the daemon was not running.

Diagnostic Steps

  • Capture the clustat output and compare the Member Status value when the event occurs.
  • Review the ps output to see if qdiskd is running on any cluster node not showing a quorum disk in clustat output.
  • Review the /var/log/messages file to see if qdiskd failed to start.
  • Capture the group_tool -v output from all the cluster nodes.
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.