Planning Fence Configuration in a Red Hat High Availability Cluster

Updated 4 Nov 2016

Before considering how you will configure fencing, you should make some decisions about how you will configure your Red Hat High Availability cluster as a whole. For a general outline of the issues in planning configuration for your cluster, see This content is not included.Initial Cluster Planning Considerations.

This article provides a summary of the considerations involved when planning the fence configuration for a Red Hat High Availability Add-On cluster. It includes sections on the following topics:

NOTE: For general information on how fencing ensures that your cluster services remain operational, see This content is not included.The Importance of Fencing in a Red Hat High Availability Cluster. It is recommended that you familiarize yourself with the information in that article before planning your fence configuration.

Fencing and Failure Recovery: General Questions and Overview

A key component of cluster planning is defining how you want your cluster to recover from a failure in order to ensure that the service your cluster is providing remains alive. When planning your fence configuration, you should take into account how you can help achieve this through fencing.

Step one: Define failure scenarios

The first step in planning your fence configuration is to break out different possible failure scenarios. In general, these are the most common reasons your system might fail and be unresponsive:

Node failure due to power loss
Node failure due to system panic
Communication failure, through network failure
Storage failure

Step two: Determine response to failure

After considering the ways your system can fail, your next step is to determine what you want to happen when your cluster experiences as failure. What actions do you want to take when a failure of each type occurs?

For example, you might consider one of the following possibilities:

Power off a failing node, then power it back on immediately
Power off a failing node, then wait a defined period of time before powering it on
Wait for administrator intervention
Contact your Red Hat Service Representative

Step three: Determine how to configure fencing

Once you have determined how you want your cluster to react, you can can consider how to achieve that through the cluster design, and in particular through fence configuration and fencing hardware.

To meet the needs you have determined, you can consider:

What type of fencing do I want?
What fencing agent in particular do I want?
What fencing options are available?
How do I choose my hardware?
How can I use multiple fencing devices?
What network equipment do I need to to address my use cases?
How can I integrate my fencing configuration with the cluster, for example through cluster network configuration or defined timeouts?

The remainder of this article discusses some of the particulars of these questions.

Types of Fencing

The two most common types of fencing are:

Power fence agents: The cluster software logs in via telnet, ssh, or SNMP to the device such as an APC switch, Dell DRAC, HP iLO, IBM RSA, or similar device and turns off (and optionally on) the power for the cluster node. This method will execute a hard "off" action.
I/O fence agents: The cluster software logs in to a fibre channel switch via telnet or ssh and disables the port(s) for that node, thereby cutting of its access to shared storage. This method requires that an administrator manually reboot or shutdown the errant node to recover it, and log in to the switch interface to re-enable the appropriate port(s). This can also be achieved via SCSI reservation fencing.

In addition, you might want to use:

Diagnostic fencing: Diagnostic fencing detects that a node has entered a crash recovery service. It is not in itself a replacement for traditional fence devices, but it allows the crash recovery service to complete without being preempted by traditional fencing methods.

You should consider the following when planning your fence configuration:

What type of fencing is suitable for your needs?
Do you need to define backup fencing for your system of a different fencing type?
Do you need to define redundant fencing devices of the same type, where one will take over if the first one fails?
Do you want to implement a diagnostic fencing device in addition to your traditional fence device?

The following subsections provide additional information about the different types of fence devices.

Power fencing vs. storage fencing

It is generally recommended that you configure your system with power fencing. This performs a hard reboot, allowing your system to come up cleanly and rejoin the cluster quickly, without an extended period of degraded service within the cluster. If you have services that are load-balanced across your nodes or distributed in some way, if one node is down you are doubling up on another node. If you do not have the ability to bring back a node's service quickly and automatically, then your system will stay in that degraded state for a longer period of time. This may impact your service time and your service responsiveness to clients. This also reduces resilience, since you cannot withstand another node failure or another application failure

For your system, however, you may prefer a storage fencing solution, which cuts off access to storage but does not shut down the node. You may be running scripts or jobs on the node that are not managed by the cluster, and you want to continue providing these services. If your deployment takes a long time to start up, of if even a short period of down time for your system is an expensive proposition, you may not want to take the system down. For diagnostic purposes, you may prefer to cut off access to storage and leave your system in that state until you can inspect the state of the system that caused it to be fenced, which is information you would lose when the system is powered off.

While storage fencing will protect your data from corruption, it will not resolve issues of network conflicts and application availability that cause a node to be fenced. This is why power fencing is the preferred primary fencing mechanism.

Diagnostic Fencing Agents

A diagnostic fencing agent is not an agent that you configure by itself to fence a node, but an agent that performs a diagnostic procedure when a node is unresponsive. It allows you to power off a node through a traditional fencing agent, but to capture information through a core dump on that node as well, before it powers off. It is possible to configure a fencing agent so that it waits for a set period of time before powering off, which can also allow time for a core dump to complete, but it is not always an ideal solution; a diagnostic fencing agent, on the other hand, is designed specifically for this situation.

Incorporating a diagnostic fencing agent into your original design can be extremely helpful in analyzing system issues.

Choosing your fence device

The Red Hat High Availability Add-On supports a variety of fencing agents. To display a list of all available agents on a system on which the Red Hat High Availability Add-On software packages and fence agents are installed, execute the following command:

# pcs stonith list

Each agent supports a specific set of options that affect how the fence agent interacts with the cluster, such as how a power cycle is implemented. Some power agents interact with a system management card, while others interact with the whole power switch and allow you to use multiple hosts with a single piece of hardware. Once you have determined your own system needs, you can see what is available to you.

To display a list of the supported options for a fence agent, execute the following command on a system on which the Red Hat High Availability Add-On software packages and fence agents are installed:

# pcs describe <stonith agent>

Fencing Networks

A key component of planning a fence configuration is choosing your fence network. In defining how you want your cluster to react to failure scenarios, decisions you make regarding your network are going to play a large part.

Public and private networks

Consider whether you want to configure fencing on the private cluster interconnect or on the public network. In general, you need to evaluate what will go down or get blocked when the network goes down, and how you want to address that situation.

For a list of the pros and cons of private and public networks for fence devices, see Should I configure my RHEL High Availability cluster nodes to access their fence devices over the same network interface as the cluster uses for communication or a separate network interface?.

Naming your network connection

When designing your network connections, you need to design the network connections for your fence device to be on the network that you need and you name it appropriately based on that.

If your fence device is configured on the public network that your whole host organization works on, then it will have an IP address and host name that is reflective of it being on that network. If your fence device is configured on your private cluster network, it will have an IP address or host name on that network.

From the cluster's perspective, the host name or IP address indicates which network the cluster is using to connect to the fence device.

Redundancy in your fence device networks

To build a robust cluster, it is recommended that you configure more than one fencing method, both a primary and a backup method. When considering what these methods should be, you need to take into account that a fence method can fail for different reasons. A power switch itself might fail, or you might lose power. You might have a motherboard failure. You need to consider hardware failure, power failure, and network failure and design redundancy into your system to account for this.

Consider what a primary or backup setup means in terms of what a cluster is going to do, and how it is going to contact the fence devices. For example, if you configure a fence device in a network that nodes are using to communicate with each other, when the network goes down any failure in communication takes fencing down as well. This provides no real redundancy other than the redundancy that backs that single network. Separating your fence network onto a different network than the cluster network is often a good idea. You can even use the public network; even if the fencing network does not have the same level of redundancy as the cluster interconnect, you are still usually only going to have one network or the other fail.

Configuring redundant fencing methods is not always straightforward, and in planning for redundancy in your fence device networks you should consider ways in which your system might fail that would eliminate the redundancy. For example, dual power supplies alone do not necessarily provide redundancy. If you are using a system management card, you might be able to give dual power supplies to that piece of hardware. If your motherboard fails, however, or your power for the whole datacenter fails, then that one management card is down.

Similarly, you may have a bonding configuration with many slave links but you can still have misconfiguration on a switch or router that prevents traffic from flowing. Or you can have a redundant bonding configuration, but the configuration does not fail over your slave unless you have a link failure and there are many types of failures that do not induce a link failure. The operating system iself may not route over an alternate slave unless a specific condition occurs. You may be using RRP, but you can only get to a fence device over one ring. What you should consider is where redundancy is most valuable for your configuration.

For information on configuring fencing for redundant power supplies, see How can I configure fencing for redundant power supplies in a RHEL 6 or 7 High Availability cluster with pacemaker?.

Fencing Considerations for Two-Node Clusters

As outlined in Fencing networks, configuring access to your fence devices over a public network provides redundancy in your system. When you configure fencing on a separate network, however, and issue can arise in a two-node cluster when the private network through which the cluster nodes are communicating with each other goes down. Since both nodes retain access to the fence devices through the public network, it is possible that each node will try to fence the other node at the exact same moment in a fence race, causing both nodes to go down and rendering the cluster inoperable. If this does happen, power-cycling a node with a power switch will restart the node, but you can avoid a fence race in the first place by configuring one of the nodes with a fence delay. For information on delaying fencing and avoiding fence races, see the following articles:

Note that when configuring a fence device with a delay, you will need to have a separate fence device for each node in the cluster since you can not apply a delay to one node and not the other with a shared fence device.

If you configure your fence devices on the private cluster network, if the cluster network goes down you will not be able to access the fence devices and the entire cluster will go down, requiring administrator intervention to restore services. When evaluating the robustness of your configuration, you should determine what will happen if the nodes can no longer communicate with each other. As part of this evaluation, you should also determine what will happen if only a piece of the private network , such as a single port, goes down.

Even if you configure a two-node system with a fence delay on one device, you are still susceptible to the problem of fence loops. If you have successfully avoided a fence race with a fence delay on node 2, then node 1 remains up. If the problem that caused the initial fence activity remains, however, when node 2 restarts it will not be able to communicate with node 1 and it will fence node 1. Then, when node 1 starts up again, it will similarly fence node 2. For information on fence loops and how to avoid them, see the following article:

How can I avoid fencing loops with 2 node clusters and Red Hat High Availability clusters?

References

For further information on fencing and fence configuration, see the following articles:

Article Type

General