How can I prevent my RHEL High Availability cluster from repeatedly failing to fence a node while the fence device is not accessible?

Solution Verified - Updated 7 Aug 2024

Environment

Red Hat Cluster Suite (RHCS) 4
Red Hat Enterprise Linux (RHEL) 5, 6, 7 or 8 with the High Availability Add On
One or more fence/stonith devices using an agent which communicates with the device over the network
- fence_scsi, fence_kdump, and fence_virt are examples of agents that do not use the network

Issue

When a node of the cluster loses power at the same time its fencing device loses power too, cluster services do not failover and/or GFS filesystems are locked.
When the network goes down fencing of the other node fails and everything locks up.

Nov 19 12:55:50 node1 fenced[2080]: fencing node node2 still retrying
Nov 19 13:26:16 node1 fenced[2080]: fencing node node2 still retrying
Nov 19 13:56:42 node1 fenced[2080]: fencing node node2 still retrying

How can I ensure I have enough redundancy in my fence configuration to avoid the cluster blocking if there is a network problem?
What is the optimal network configuration for fence devices?
How to configure backup fencing method?

Resolution

Fix the Current Problem

Once a cluster is in a state where it is repeatedly failing to fence a node, there are two general methods for correcting it:

a) Restore the ability for the nodes to reach the fence devices and carry out their fencing operation. For example, if the network is down, restore connectivity; if the fence device hardware has failed, replace it; if the power has been severed to the device, restore power. If you replace the device, be sure that it has the same login credentials and other settings as the previous device, so that subsequent attempts to login to that device from the fenced or stonith-ng daemon is successful. Once the problem that is causing the fence operations to fail is corrected, the next retry of this operation should succeed, allowing the cluster to recover and resume operation if quorum is still had.

b) Manually fence (power off) the failed node(s), and acknowledge on a remaining node this has been done with fence_ack_manual in cman-based clusters or with pcs stonith confirm in pacemaker-based clusters.

Confirming manual fencing in `cman` clusters with `fence_ack_manual`

WARNING: The errant node must be completely powered off before running this command. Otherwise data corruption or other shared resource conflicts may occur.

RHEL 6

# fence_ack_manual -n <nodename of manually fenced node>

RHEL 5

# fence_ack_manual -e -n <nodename of manually fenced node>

Confirming manual fencing in `pacemaker` clusters with `pcs stonith confirm`

WARNING: The errant node must be completely powered off before running this command. Otherwise data corruption or other shared resource conflicts may occur.

# pcs stonith confirm <nodename of manually fenced node>

Prevent Future Occurrences

To prevent this problem, add a backup fence device or add sufficient redundancy to the fence device layout.

Adding a Backup Device

If another method can be used to supplement the existing primary fence device, it can help prevent the entire cluster blocking if the primary method is unavailable. fence_scsi is a popular choice for a backup fence method because it does not require access to the network to function, meaning if the fence device can't be reached due to a network outage, it may still be possible to remove another node from the shared resources via SCSI Persistent Reservations and proceed with normal operations. Secondary power methods or other storage-based fence devices that are accessed over the network are also acceptable, but if they use the same network interface and/or infrastructure as the primary device, this may not add any useful redundancy, because a network outage can prevent both from being accessible.

Adding Redundancy for Existing Devices

Some styles of fence device are inherently prone to common types of failures which may lead to them being inaccessible to the cluster at critical moments. One such common example is with the use of devices integrated into the motherboard of the server they are managing, such as iLO, DRAC, RSA, or others like them. With these types of devices, if the power to the server is lost, or the motherboard for the server fails completely, then the ability of the device to process incoming connections or carry out fence operations is generally lost, causing fencing to fail over and over. As such, these devices often should be made redundant in as many ways as possible to account for such situations. This may mean configuring redundant power supplies, or redundant network connections, or adding a backup device as described above.

In general, if a single device is responsible for fencing a node and has no backup device configured, then that device should have redundancy in all of its components in order for it to be considered completely resilient to failures. If the hardware only has a single processor, or single network connection, or single power supply, then it cannot be counted on to offer fencing services for the node(s) in question under all circumstances, and it should be made redundant in whatever way possible.

Adding Redundancy for Network Connections to Devices

Having redundancy in the network architecture of a cluster is critical to minimize the risk of a single event completely blocking all activity. Having redundant rings or bonded interfaces for the cluster communication network is always recommended, but its common for the fencing network to be overlooked. If there is only a single interface that carries traffic to the fence devices, this network could become unusable from a single hardware failure, high load event, or other connectivity problem. Even if the server has a bonded or teamed network connection over the network that a fence device is accessed over, if the device itself does not have redundant connections then it is still possible for connectivity to be blocked if the single network path to that device fails. On all connection points there should be redundancy to ensure that connectivity can be maintained at all times.

If the fence device is accessed by cluster nodes over the same network interface that the cluster communications are carried over, then it may be worthwhile to assess if splitting the fence traffic to another interface may offer additional redundancy.

Root Cause

When using Red Hat cluster software, at least one fence device should be defined for every node so that it can be disconnected from the cluster's shared storage in the event of a failure or error. Fencing cuts off I/O from shared storage, thus ensuring data integrity. The two most common types of fencing are:

Power fencing: The cluster software logs in via telnet, ssh, or SNMP to the power supply, such as an APC switch, Dell DRAC, HP iLO, IBM RSA, or similar device and turns off (and optionally on) the power for the node thereby causing a hard shutdown, but usually a reboot.
I/O fencing: The cluster software logs in to a fibre channel switch via telnet or ssh and disables the port(s) for that node, thereby cutting of its access to shared storage. This method requires that an administrator manually reboot or shutdown the errant node to recover it, and log in to the switch interface to re-enable the appropriate port(s). This can also be achieved via SCSI reservation fencing.

In RHEL 4, 5, and 6. If there is a CMAN transition that imposes the fencing of one node, fenced will be contacted and fenced will organize the fence action that will isolate the target node. Similar behavior is carried out by stonith-ng in RHEL 6 and 7 using pacemaker.

If all configured fence methods or levels fail, fenced will continued to loop through all fence levels until a fence level is successful. This is the standard behavior. When this action takes place, the logs will be filled with messages from fence saying that the fence actions have failed.

In such a case the cluster will have its activities blocked, therefore shared resources and services like DLM, GFS/GFS2, clvmd, rgmanager, and others will be blocked, and further resource management and recovery may be blocked. It is necessary to resume the fence action successfully, so that the cluster will unlock itself and continue operations.

This problem is more noticeable in clusters using one single fence method of the type iLO or DRAC, because in such cases, if the power to the machine is lost, the other node will try to fence it in a never-ending loop, because it will never be able to contact iLO or DRAC on the target machine, since it has no power.

The This content is not included.documentation for the applicable release of RHEL being used in this High Availability cluster gives more details on configuring fence devices, methods, levels, and other components of the cluster to ensure maximum redundancy and optimal operation.

SBR

Clusterha

Product(s)

Red Hat Enterprise Linux

Components

cluster

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.