All nodes in a a RHEL 5 or 6 cluster were powered off or fenced when using a quorum disk and a "last man standing" configuration

Solution Unverified - Updated

Environment

  • Red Hat Enterprise Linux (RHEL) 5 or 6 with the High Availability Add On
  • Cluster utilizing a quorum device (<quorumd> in /etc/cluster/cluster.conf)

Issue

  • During routine operation, we encountered a situation where some cluster node fencing was done, and ultimately all of the nodes were shut down by the clustering software.
  • Two nodes in a 3-node cluster can race to fence each other while the 3rd node is down
  • In a last-man standing quorum-disk configuration, one node was fenced by two different nodes simultaneously, resulting in that node never powering back on.
  Mar 12 05:49:34 node1 fenced[11075]: fencing node "node3.example.com"
  Mar 12 05:49:39 node2 fenced[11044]: fencing node "node3.example.com"

Resolution

  • Consider removing the last-man-standing configuration and using a standard quorum model.

  • If keeping the last-man-standing configuration, ensure the heuristics are designed such that in the event of a loss of communication between nodes, only one node can have a high enough score to retain the votes from the quorum device. A common example is pinging the gateway on the cluster interconnect as the only heuristic. See this solution for further guidance on how and when to implement a proper heuristic that resolves membership splits following network connectivity problems.

Root Cause

  • Normally in three node clusters, problems such as fence races and split brain are not relevant because of the uneven nature of the cluster. If all nodes lose contact with each other, none of them have quorum (1/3 votes) and so none can fence the other and there is no risk of multiple nodes attempting to access shared resources because they are all inquorate. In order for fencing or takeover of services to occur, two nodes have to be in contact with each other and thus there is no chance of multiple partitions in the cluster operating independently.

  • When a quorum device is configured in a 3-node cluster in a "standard" way (1 vote for each node, 1 vote for quorum device, giving a total expected of 4 and quorum of 3), the same dynamic holds. If all nodes lose contact, each node can only have a maximum of 2 votes with the quorum device, still not enough for quorum. If any 2 nodes are in contact with each other they will have quorum, and since they have a majority the other node is unable to operate independently.

  • However, in a 3-node cluster in a "last man standing" configuration (1 vote for each node, 2 votes for quorum device, giving a total expected of 5 and quorum of 3), a single node can maintain quorum by itself with just the quorum device. Now if all the nodes lose communication with each other, and the heuristics are NOT designed in such a way that they can only operate on ONE node at a time when a split like this happens, then all 3 nodes may have a high enough score to keep the quorum device votes, and thus all 3 nodes maintain quorum. With quorum, the three nodes can all attempt to fence each other simultaneously, potentially powering each other off. You may also have odd situations where one node is missing, the others lose contact with each other and all fence the missing node simultaneously, leaving it in an unexpected state (like OFF).

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.