Delaying Fencing in a Two Node Cluster to Prevent Fence Races or "Fence Death" Scenarios
Environment
- Red Hat Enterprise Linux (RHEL) 5 Update 6 or later (with the High Availability Add-On)
- Red Hat Enterprise Linux (RHEL) 6 Update 1 or later (with the High Availability Add-On)
- Red Hat Enterprise Linux (RHEL) 7 and later (with the High Availability Add-On)
- Two-node High Availability cluster
- While a
delaycan be used for various purposes, it is typically only needed in environments where those nodes can race to fence. This generally only occurs if the fence device is accessed over a separate network interface than that which is used for cluster communications.
- While a
Issue
- How can I avoid fence races or fence death scenarios when both nodes detect a token loss?
- How can I resolve a one-time fencing race without a quorum disk in two node clusters?
- My two node cluster turned itself off.
- Two Nodes fencing each other.
- Adding delay attribute in fencing agent
Resolution
RHEL 6 with cman and pacemaker, RHEL 7 and later with
pacemaker
When each stonith device manages one node
Configure one node's `stonith` device with the `pcmk_delay_base` attribute:
# # pcs stonith create <name> <agent> [options] pcmk_delay_base=<seconds>
# # Example:
# pcs stonith create node1-fence fence_ipmilan ipaddr=node1-ilo.example.com login=fenceuser passwd=fakepassword pcmk_host_list="node1.example.com" pcmk_delay_base=5s
# pcs stonith create node2-fence fence_ipmilan ipaddr=node2-ilo.example.com login=fenceuser passwd=fakepassword pcmk_host_list="node2.example.com"
The delay should be configured on all stonith devices for the node that should be chosen to win races. Do note that a delay should not be set on a device that is shared between nodes. If both nodes utilize the same device, create a stonith for each one using the same parameters, specify pcmk_host_list to include only the applicable node for each one, and only specify pcmk_delay_base for the node that should win races.
Note: If pcmk_delay_base is not recognized as a valid option on very old pacemaker releases, then try using delay=<integer> instead. Note that pcmk_delay_base can accept a unit after its value (e.g., 5s), but delay requires a bare integer (e.g., 5), which is interpreted as the number of seconds.
When a stonith device manages multiple nodes
Some fence agents, like `fence_vmware_soap` or `fence_scsi`, can manage multiple nodes with one stonith device. Setting the `delay` attribute on the shared device applies the same delay to each node's stonith actions. This does not prevent fence races.
pacemaker-1.1.12-22.el7_1.2 or earlier
Instead of creating a shared device, create a separate device for each node and use the `delay` attribute.
# # pcs stonith create <name> <agent> [options] delay=<seconds>
# # Example:
# pcs stonith create vmfence1 fence_vmware_soap ipaddr=vcenter.example.com login=fenceuser passwd=fakepassword ssl_insecure=1 pcmk_host_map='node1.example.com:vm-node1' delay=5
# pcs stonith create vmfence2 fence_vmware_soap ipaddr=vcenter.example.com login=fenceuser passwd=fakepassword! ssl_insecure=1 pcmk_host_map='node2.example.com:vm-node2'
pacemaker-1.1.12-22.el7_1.4 or later
Configure the stonith device with [`pcmk_delay_max`](/solutions/3565071) attribute.
# # pcs stonith create <name> <agent> [options] pcmk_delay_max=<delay_time>
# # Example:
# pcs stonith create vmfence fence_vmware_soap ipaddr=vcenter.example.com login=fenceuser passwd=fakepassword ssl_insecure=1 pcmk_host_map='node1.example.com:vm-node1;node2.example.com:vm-node2' pcmk_delay_max=15
For more information, refer to the following article: How do I delay fencing to prevent fence races when using a shared stonith device in a two-node cluster?
pacemaker-2.1.2-4.el8 or later
.The pcmk_delay_base parameter may now take different values for different nodes.
When configuring a fence device, you now can specify different values for different nodes with the pcmk_delay_base parameter. This allows a single (or shared) fence device that is used by multiple cluster nodes to have a different delay for each node. This helps prevent a situation where each node attempts to fence the other node at the same time. To specify different values for different nodes, you map the host names to the delay value for that node using a similar syntax to pcmk_host_map. For example, node1:0;node2:10s would use no delay when fencing node1 and a 10-second delay when fencing node2.
In this example, node1.example.com will be delayed by 10s before it would be fenced. In the event of a split-brain, node2.example.com would be fenced first because node1.example.com would be delayed by 10s before it would be fenced.
# pcs stonith create vmfence fence_vmware_soap ipaddr=vcenter.example.com login=fenceuser passwd=fakepassword ssl_insecure=1 pcmk_host_map='node1.example.com:vm-node1;node2.example.com:vm-node2' pcmk_delay_base='node1.example.com:10'
RHEL 5 or RHEL 6 with
cman
Configure one node's fencedevice with the delay attribute in /etc/cluster/cluster.conf.
For example, in this instance node1 will have a delay of 10s. The other cluster node (node2) will not have a delay so it will attempt (and should complete) to fence node1 before node1 can fence node2.:
<clusternode name="node1.example.com" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="Vmware" port="node1" uuid="4223dbb3-6ec6-fg65-ef4c-dhu7562dff56cf" delay="10"/>
</method>
</fence>
</clusternode>
In this example, the delay is set where the fence device is defined.
<fencedevices>
<!--
<fencedevice name="<name>" agent="<agent>" ipaddr="<ip or hostname>" [... parameters ...] delay="<seconds>"/>
Example:
-->
<fencedevice name="node1-virt" agent="fence_xvm" port="vm-node1" delay="5" />
</fencedevices>
Update and propagate the configuration after making any such changes to /etc/cluster/cluster.conf.
General Notes
- Not all fence agents support the
delayattribute. Check the manpage for the fence agent in use to determine if it is supported, or contact Red Hat Global Support Services for assistance. - The node with the
delayspecified on its device will win fence races. The node without thedelaywill be the one to get fenced in the event of a network split and fence race. - The delay used here should usually be at least 5 seconds or more, to give one node enough time to complete the fencing operation before the other node begins. This may need to be adjusted based on the actual amount of time it takes the fence action to complete.
- It is strongly recommended that the "losing" node (the one without a delay on its fence device) be configured to avoid fencing the other node when it boots back up, also known as a "fence loop".
- The nodes may require important errata for fence delays to work properly in certain environments:
See Also
- What are my options for avoiding fence races in RHEL 5, 6, and 7 High Availability clusters with an even number of nodes?
- Can pacemaker fence the cluster node with the fewest running resources?
Root Cause
Fence Races
Two node clusters in Red Hat Enterprise Linux operate in a special mode. Traditionally, fencing requires a property of the cluster called quorum - the minimum set of hosts required to provide service (in some cluster technologies, this is also referred to as a primary component; the terms are synonymous).; In Red Hat Enterprise Linux, the specific quorum algorithm used is called simple-majority quorum meaning a majority of hosts must be online in order for a quorum to be present. This means that in an 8-node cluster, at least 5 must be online in order to provide service, in a 5-node cluster, at least 3 nodes must be online, and so forth. Generally speaking, quorum is a means to prevent a case called a split brain, where two subsets of a cluster operate independently of one another.
In two node clusters, there is no majority when one node is active. Instead, the cman component in RHEL 5 and 6 clusters relies on a special mode called two_node, as does corosync's votequorum plugin in RHEL 7 clusters. In two_node mode, both hosts always have quorum, resulting in limited split brain behavior. The reason that this is a limited split brain case is because all of the components provided by Red Hat's clustering products not only rely on quorum, but also a mechanism called I/O Fencing (some clustering technologies call this STONITH or STOMITH (acronyms for Shoot The Other Node/Machine In The Head)). I/O fencing, or simply fencing, is an active countermeasure taken by a cluster in order to prevent a presumed-dead or misbehaving cluster member from writing data to a piece of critical shared media. The act of cutting off this presumed-dead member prevents data corruption on shared media. Since all of Red Hat's High Availability and Resilient Storage components rely not only on quorum, but also fencing, data integrity is preserved in this limited split brain case even if both sides maintain quorum.
Now, when a two-node cluster partitions into two independent sides for any reason, both nodes, since they have quorum, enter what is called a fence race. This means both are trying to cut each other off in order to establish a new leader so recovery can complete, thereby allowing the cluster to continue providing service to clients.
In cases where the fence device can be accessed independently from both nodes even while they are unable to communicate with each other, such a race can be problematic, as both nodes may be able to initiate a power-off action for the other simultaneously, leaving both nodes in a down state. This is typically only possible with power-fencing methods, and only when the fence devices are accessed over a separate network interface than is used for cluster communications, because if one network interface were shared for fencing and communications, then a split between the nodes would usually also mean that at least one or the other is unable to fence its missing partner, preventing any fence race.
Certain fence agents support an attribute called delay which will cause a node accessing that device to simply wait a defined number of seconds before proceeding. This is effective in ensuring that the result of a fence race is predictable, since the node without a delay on its device will be fenced faster. In other words, the node whose device is configured with a delay will be able to fence the node without a delay very quickly, whereas the other node will have to wait for a short time before fencing this one. This imbalance results in one node being predictably fenced when there is a race, and if desired, that node can be configured to be preferred to run critical services, so that there is less chance that the service-owning node is the one getting fenced when there is a network split.
Avoiding Fence Loops After Fence Races
However, that alone does not fully solve the problem of a fence loop in a split brain scenario. After the node that "lost" the fence race reboots it will come back up, and if there is still a network split between nodes it will form a quorate single node cluster, and attempt to fence the other node.
To ensure that this does not happen, it is strongly recommended that additional measures be taken to avoid a recently-fenced node fencing the remaining node when it joins back up.
Resources
For more information, see the tech brief at https://access.redhat.com/knowledge/techbriefs/fencing-red-hat-enterprise-linux-methods-use-cases-and-failover.
The delay functionality was only added to select fence devices starting with cman-2.0.115-68.el5 in RHEL 5 and fence-agents-3.0.12-23.el6 in RHEL 6.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.