Exploring Concepts of RHEL High Availability Clusters - Quorum

Updated

Contents

Overview

Applicable Environments

  • Red Hat Enterprise Linux (RHEL) 6 or 7, 8, 9 with the High Availability Add-On

Useful References and Guides

Concept Overview

What is quorum?

Quorum is the condition of whether a node has authority to carry out the functions of a cluster - to manage resources, to participate in recovery operations, to make or execute fencing/STONITH decisions, and more.


Why is quorum necessary?

Nodes may lose contact with each other and need to be able make decisions independently, without knowing what state a missing member may be in. The High Availability components running on each node of a cluster must have some policy or mechanism by which they can determine if this system is a member with authority, or if this member has lost its authority. A cluster made up of systems that cannot communicate with each other would be at risk of corrupting shared resources or carrying out conflicting actions.

Quorum is the policy that each node evaluates to determine if it is safe to continue carrying out the functions of the cluster.


What is a membership partition?

When members of a cluster lose contact with each other, the groups they form are referred to as partitions. This grouping of members is useful for conveying how quorum decisions would be made.

For example: if a four node cluster has a network disruption that leaves nodeA and nodeB in contact with each other, nodeC and nodeD in contact with each other - but the two groups cannot communicate between themselves - each of those sets could be referred to as a partition.

In that example, the partition of nodeA and nodeB would both see the same membership state, and thus should be reaching the same conclusion about their quorum situation. The same goes for nodeC and nodeD - they each are processing the same conditions, and thus should be making equal decisions to each other...(ignoring the possibility that some additional quorum-influencing mechanism differentiates the conditions amongst those members).


What does it mean for a cluster to be in a "split-brain" state?

The term "split-brain" generally describes a problematic condition that high availability cluster solutions must avoid or address when members of such a cluster lose contact with each other. When cluster members are unable to communicate, then they may be at risk of performing conflicting activities because they cannot coordinate with each other.

If the membership of a high availability cluster were to split into distinct sets of nodes that could not communicate with each other, and those "partitions" each considered themself as qualified to continue managing shared resources - they would be in a split brain state. A possible result from that split-brain state could be multiple systems mounting a non-cluster-aware filesystem, thereby corrupting it, for example.

The RHEL High Availability cluster software is designed to avoid split-brain conditions through two mechanisms:

  • The quorum policy should result in only one partition of a cluster membership having authority to carry out tasks in the cluster.
  • The cluster implementation should require that fencing/STONITH of missing members be completed before cluster activity can resume - thereby ensuring only one partition has access to shared resources.

What conditions can be configured to influence quorum decisions?

  • Majority wins: Agreement amongst a majority of nodes about which members are responsive. This is the typical policy of a cluster unless otherwise configured.

  • Two-node cluster: The majority-wins policy doesn't make much sense in a cluster with two nodes. A single member would not have a majority, meaning a failure of a single node would leave the cluster unable to function - not a very useful design for a system that is meant to function through member-failures. Two-node clusters can have their quorum-policy relaxed to allow a single member to maintain authority in the cluster.

    • NOTE: This special two-node policy introduces a risk of a split-brain condition, since both members of a two-node cluster technically could consider themselves quorate. An additional quorum-influencing condition - a.k.a. a quorum arbitration method - should be built into the cluster's policy to ensure membership-splits result in only one member considering itself an authority in the cluster.
  • Tie-breakers: Even-sized clusters - those with 2 nodes, or 4 nodes, or 6, etc - are susceptible to having their membership split into equal sized partitions, meaning neither one would have majority. In a simple majority-wins system, all members of the cluster would be without cluster-authority, and thus the entire cluster would stop serving its duties. A tie-breaker is some additional method by which the members of a partition can decide if they should consider themselves quorate. Such tie-breaker methods can be any of

    • Testing connectivity to some external host - example: quorum arbitration with corosync-qdevice and a corosync-qnetd server in RHEL 7
    • Following a policy that one member is the deciding "vote" - example: corosync's auto_tie_breaker in RHEL 7
    • Running some user-defined script to test an arbitrary condition - example: qdiskd's heuristics in RHEL 6.
  • Last-man standing: Such a policy dictates that a partition with fewer members than a majority should be able to maintain their authority within the cluster.

    • Example: corosync's last_man_standing policy in RHEL 7 results in the total number of members required for quorum being recalculated after membership changes. This can allow the total number of votes to maintain authority being lessened in a cascading fashion, requiring fewer and fewer members be present to continue operating. Ultimately this can allow a single node to carry the functions of an entire cluster even by itself - which could be useful to avoid a complete loss of service.
    • Example: qdiskd in RHEL 6 can be configured to allow a single node to survive with quorum if it still has access to a storage device over which nodes can communicate with each other.

What are votes?

Votes are the numerical "weight" of each node used in the process of calculating quorum.

Quorum decisions are - at the most basic level - determined on a basis of how many nodes are in agreement about their membership list. However in reality, the quorum policies of RHEL High Availability clusters are calculated on the basis of the number of "votes" assigned to each member, and possibly to other quorum-influencing factors. This allows for more complex systems that may weight some nodes more heavily in quorum decisions.

Typical High Availability clusters have one vote assigned per node. This is the configuration Red Hat recommends for almost all scenarios - except for rare deployments based around complex use cases. An uneven distribution of votes across cluster nodes allows the cluster to consider some nodes more or less important than others - which could be useful if the servers have unequal processing power, for example.

When applying any quorum policy - majority wins, two-node, tie-breakers, etc - the calculations are really derived from how many votes are counted from active members and the votes from any quorum-influencing methods that are in effect.

See the Quorum Example Scenarios section below for examples of how votes are associated with members.


What is quorum arbitration?

quorum arbitration refers to some method by which the quorum policy is influenced beyond just a majority-wins policy. Having a cluster decide the fate of its members and its functionality based on a simple majority could result in the cluster either being completely unable to service its duties, or perhaps deciding in a suboptimal way to distribute those duties.

If a cluster has an even number of members - 2, 4, 6, etc - then it can split into equal partitions, neither of which has a majority. If the cluster doesn't have some additional factor to decide which partition has authority, then all nodes may be left non-functional.

As noted earlier - tie-breakers may influence the quorum policy. These are quorum arbitration methods.

The method to influence the quorum policy could be as simple as "a partition which contains member X shall have authority, if two partitions are otherwise equal" (corosync's auto_tie_breaker); or the cluster could use some system to test which partition is healthier and thus better suited to continue functioning (corosync-qdevice).


How does pacemaker_remote on a "remote node" tie in with quorum?

pacemaker_remote is a component that can run on a system and enable the "remote node" to participate in resource-management activities of a cluster without having any authority itself.

A remote-node can be tasked with running workloads and providing high availability of resources, but does not factor into quorum decisions. That member doesn't participate in membership protocols, and isn't counted in quorum calculations.

In relation to quorum, remote-nodes can be ignored.


Components Relevant to Quorum

corosync

In both RHEL 6 and 7, corosync is the main engine of the cluster that facilitates communication between nodes for the purposes of establishing a membership and monitoring which nodes are responsive. This corosync engine runs on each full-member of a cluster and attempts to communicate with corosync on any other active member, and stay in contact with them. They participate in a communication protocol to establish an initial membership list - a.k.a. a "ring" - and then pass ongoing heartbeat-tokens between them to maintain awareness of which ones are still responsive.

The membership lists established by corosync become the basis for quorum decisions. If corosync establishes that some set of members are in contact, then those members votes would be voted in subsequent quorum calculations.


corosync's votequorum service in RHEL 7

corosync comes bundled with an internal service in RHEL 7 called votequorum. This is essentially an internal component that implements the core quorum policies of a RHEL 7 cluster.

votequorum offers settings to dictate how quorum is calculated - two_node, auto_tie_breaker, last_man_standing, and similar configuration settings influence quorum calculations.

votequorum's core quorum policy is a majority-wins system. In the most basic configuration, the number of votes required for a majority is calculated from the total number of nodes seen (either in the configuration or through member communication). In more advanced configurations, some additional arbitration method may be registered to offer additional votes.


cman in RHEL 6

corosync provides the membership and communication engine in RHEL 6 clusters, as it does with RHEL 7. But in RHEL 6, cman is essentially the middleman between corosync and other High Availability components.

cman consumes membership information produced by corosync and applies its quorum policies and settings to determine which nodes have authority. Other components like pacemaker or rgmanager may then connect to cman and consume its membership and/or quorum information.


corosync-qdevice and corosync-qnetd in RHEL 7

qdevice and qnetd are optional components of a RHEL 7 cluster that can coordinate to serve as a quorum-arbitration method.

qnetd is a server application that runs on one or more systems that are not members of a High Availability cluster.

qdevice is the client application which can be run on all nodes of a High Availability cluster, contacting the qnetd server and using that connection to make further decisions influencing the local system's quorum situation.

qdevice and qnetd coordinate to implement a chosen algorithm in deciding which nodes should be considered as having authority granted by quorum. These decisions are fundamentally based on the idea that connectivity to an external server - the qnetd server - can serve as a useful qualifier of whether a node is healthy. If some nodes are out of contact with the external server and some nodes are in contact, those nodes in contact would seem to be more capable of serving the duties of the cluster. If all nodes are equally connected to the external server but aren't in contact with each other, then the qdevice algorithms can make a definitive decision about which members should have quorum.


cman qdiskd in RHEL 6

cman provides a storage-device based quorum-arbitration method in RHEL 6 called qdiskd. This daemon is run from each node of the cluster, utilizing a block-storage device to communicate with other members and assess which are responsive. It can also execute arbitrary user-defined "heuristics" - shell commands or scripts - that function as a scoring system allowing nodes to distinguish their statuses from each other in membership-split scenarios.

The qdiskd storage-device-communication and heuristics mechanisms provides additional influences on the quorum status of each node. Access to the quorum-device can add votes to a given node's quorum calculation - setting it apart from other nodes without access. Heuristics can cause a node to reboot itself - leaving nodes which succeeded in their heuristics to remain the authoritative members.


Quorum Example Scenarios

These scenarios involve one-vote-per-node unless otherwise stated:

Two-nodes, two_node enabled: Two nodes are in the membership - Total expected votes: 2; votes needed for quorum: 1

Under the normal majority-wins system, you would typically need two votes - meaning such a cluster could not withstand a single node loss. For this cluster to be useful, this special two_node mode alters the quorum policy so that a single node can maintain quorum with only its own vote.

If nodeA and nodeB are members together - each would see itself with two total votes. When nodeB suddenly crashes and stops responding to nodeA, nodeA is left with 1 vote - enough for quorum.

If nodeA and nodeB lose contact with each other but are both still alive, then both can make that same decision - resulting in both have quorum.
This is the reason that two_node clusters require a further quorum arbitration method, to influence the result to only allow one quorate partition.


Three nodes, majority wins: Three nodes in the membership - Total expected votes: 3; votes needed for quorum: 2

If nodeA, nodeB, and nodeC are communicating with each other, they each see themself as having three votes.

If nodeC crashes, nodeA and nodeB each see themself with two votes, maintaining quorum. They have authority, so they fence nodeC and resume operations.

If a network disruption occurs that prevents any node from being in contact with the others, all three of nodeA, nodeB, and nodeC only count a total of one vote - their own. No single node would consider itself quorate, and could not proceed with cluster activities. No node has quorum, so no nodes fence the other missing members, and no nodes continue with resource management.


Four nodes, majority wins: Four nodes in the membership - Total expected votes: 4, votes needed for quorum: 3

If nodeA, nodeB, nodeC, and nodeD are in contact - they all see themselves as having four votes.

If nodeD crashed while the other three stayed in contact - those three each have three votes, maintaining quorum. They have authority to fence the missing member and resume activity in the cluster.

If nodeD had its network connection disrupted - it would see itself as having one vote, while the partition of nodeA+nodeB+nodeC would consider themselves as having three votes. nodeD would have no authority, so it would not carry out fencing; the partition of three members would have authority, so they would fence nodeD and resume activity.

If a split occurs down the middle of this four-node membership - resulting in partitions of nodeA+nodeB vs nodeC+nodeD - each member of each partition would consider itself having two votes. Two votes is not enough for quorum, so both sides would lose the ability to operate.

NOTE: Such an even-sized cluster might be useful if there's low risk of having such 50/50 splits - as it would still offer protection from the failure of some number of nodes (one in the case of a four-node), providing high availability for services through the remaining nodes. But if the cluster is at risk of having half or more of the nodes fail - or if the layout is susceptible to communication losses between two even sets of nodes - then a further arbitration method would be useful.


Four nodes with corosync's auto_tie_breaker: Four nodes in the membership - Total expected votes: 4, votes needed for quorum: 3; one pre-determined partition considered to have quorum if it has 2-votes.

This scenario is very similar to the previous, with a key difference in how it handles even-splits in membership.

If the membership splits into two-member partitions, the auto_tie_breaker feature would influence one of those partitions to consider itself as quorate and the other to be inquorate. This determination is based on which partition contains a specific member - by default the lowest node ID in the membership.

If the "tie-breaker node" is nodeA, and the membership splits into partitions of nodeA+nodeB vs nodeC+nodeD - they both have only 2 out of 4 votes, and thus not typically enough for majority-based quorum. However with the tie-breaker node belonging to the nodeA+nodeB partition, those members consider themselves quorate as long as they are in contact with each other and having 2 votes between them.

If nodeA and nodeB crash - maybe due to a power failure - nodeC and nodeD would be in a partition by themselves, giving them 2 votes. But since the tie-breaker node is not a member of their partition, those 2 votes are not enough for quorum.

NOTE: This configuration can be useful for giving a portion of the cluster some way to continue providing service through a subset of failure scenarios; however this cluster is also at risk of ceasing to function if the membership splits and the pre-determined member is not around. Most deployments would benefit from a quorum arbitration method with more intelligence, such as qdevice.


Four nodes with qdevice / qnetd: Four nodes in the membership plus one vote from the qdevice - Total expected votes: 5, votes needed for quorum: 3

The minimum for quorum would be three again - as with the previous setup, but the quorum-device functionality would let one set of nodes with less than a majority of votes maintain quorum if they could contact the external qnetd server.

If there is an even split between nodeA+nodeB vs nodeC+nodeD where both sets could still contact the quorum server, then one of those two sets would be chosen as the winner according to the configured algorithm; that partition would maintain quorum, while the other partition would become inquorate.

If the membership split and only one of those partitions could reach the quorum server, that partition would "win" and maintain quorum.


SBR
Category
Components
Article Type