Diagnostic Procedures for RHEL High Availability Clusters - Troubleshooting fencing problems in RHEL 6, 7, or 8

Updated 11 Mar 2022

Overview

Applicable Environments

Red Hat Enterprise Linux (RHEL) 6, 7, or 8 with the High Availability Add-On

Troubleshooting Fencing

This article provides an overview of the procedures you can follow to troubleshoot a system in which fencing is failing and problematic services do not fail over as designed. Note that as a general practice, you should test your fence devices and fence configuration when initially configuring your cluster.

NOTE: If your cluster experiences communication failures, this often results in a node being fenced. In that case, however, the issue is one of communication rather than fencing itself, since the fence operation was successful. Fencing itself is the expected result of communication failure, and is the way the system protects itself. For information on the importance of fencing, see Fencing in a Red Hat High Availability Cluster.

Conditions to Look For

If fencing is failing in your cluster, you may see the following symptoms on a cluster that had previously been running:

Cluster resources do not fail over or fail back
You cannnot start and stop your cluster resources
The cluster daemons are not fully starting properly
Cluster operations are not succeeding soon after starting a cluster
gfs2 and clvmd operations are hanging

In general, when a fence operation is failing all cluster-related activity is blocked.

Note that some of these symptoms could also result from communication failures among nodes which result in fencing failure. When fencing is failing, however, that is the symptom that needs to be resolved to get the cluster to function properly under those conditions. The fact that fencing is failing is what is blocking everything else.

The reason that fencing failure blocks cluster activity in a Red Hat High Availability cluster is that all of the cluster daemons are configured to tie in to the fence daemons to determine whether a missing node has been fenced. If the node has not been fenced, the cluster daemons cannot perform any action. This prevents the situation known as "split brain" in which two nodes of the cluster cannot communicate with each other and as a result they are able to carry out independent actions. Even when fencing fails you will not have a split brain situation because the daemons will continue to wait until fencing is successful.

CAUTION: In a Pacemaker cluster, it is possible to disable stonith, in which case fencing is not attempted and the cluster daemons and operations do not wait for fencing to complete. If one node loses contact with another node, it could take over the resources that are still running on that node. For this reason, it is important that you do not disable stonith to address issues with fencing failure, even if this allows your cluster to continue a blocked operation.

General Diagnostic Approach

Determine if fencing failed during initial cluster setup: You may see indications of fencing failure when you are first setting up your cluster, even if you have not yet created any fence devices. This can result from a time delay issue, when the different cluster components have not all had time to come up with all the daemons running on each node. In this situation, nodes to attempt to fence the nodes they cannot yet talk to.

Once the system daemons are up and running, this situation should resolve itself. A system running Pacemaker will continue to show errors after they have been resolved, however, since Pacemaker keeps a record of the number of failures that have occurred so that it can determine when a maximum threshold of failures has been reached at which point the resource is migrated to another node. If you determine that you issue is a temporary one caused by a time delay in initial operation, you can clear those error messages from your display with a pcs stonith cleanup or a pcs resource cleanup command.

Look for evidence of fencing failure: On a Pacemaker system, a pcs status command will show a fence device as stopped if it is not working properly.

In the /var/log file you might see messages regarding the following situations that could indicate a fencing failure:

a processor failed forming a new configuration
a fencing timeout expired
cluster membership changed to include only the local member
failure to stop a resource

You may, for example, see a message indicating that the 'pcmk_reboot_action' did not succeed, which would indicate that the command sent to the resource agent to stop a node did not succeed.

Recommended Diagnostic Procedures

Reproduce the fencing failure: As a first step in troubleshooting a fencing failure, you should test the fencing again to see if you can reproduce the same error before changing the configuration. If you find, for example, that a consistent fencing timeout occurs every time you perform the same action, then you know that the error is reproducible and you can try different ways of resolving the error.

One way to test the error is to bring the system back up and panic the node for which fencing failed, while watching the logs for any messages. If you cannot reproduce the error, the error may have been specific to a moment in time, such as an event in your system's network. If your site has a network administrator, you might want to see if the network itself logged anything at the time of the fencing failure. In this situation, if you choose, you can implement more thorough monitoring or diagnostic tools and wait for this to occur again.

If you can reproduce the scenario see if you can reproduce the failure three or four times to make sure that what you see in the system status and logs is consistent, so you can be sure you are addressing the correct issue.

Increase the fencing timeout values: As the next step in troubleshooting a fence device failure, increase the timeout values for the device. This can help you determine if there is an issue you need to resolve or whether the operation takes a longer time in the environment than the default timeout values allow.

The default timeout values for Pacemaker stonith devices are 20 seconds. Some devices might take longer than that to respond. You could try increasing that value to 120 seconds. If a fence device takes longer than that to respond this in itself is a problem.

There are several different stonith timeouts available in Pacemaker that allows you to set timeouts for specific operations on a specific device which you can set and then test those specific operations:

stonith timeout: entire timeout that applies to the rebooting of a node through stonith including all of the actions you must perform to fence the node
pcmk monitor timeout: changes the timeout value on this specific node
pcmk start timeout: controls the timeout on start operations
pcmk stop timeout: controls the timeout on stop operations

For further information on setting stonith timeouts, see the following articles:

To see the default values for pcs stonith properties, run the pcs property show --defaults command.

To see the current values for pcs stonith properties, including the unset properties and the properties defaults, run the pcs property show --all command.

Analyze the system after timeout increase: If fencing still fails after increasing your system's fencing timeout values, you should analyze your system's state and any error messages that were generated for changes to your system's behavior. Your system at this point may return an error rather than just timing out.

A fence agent itself has many timeout values that control timeouts at different stages of processing. Increasing a stonith timeout value may give the agent enough time to hit one of these timeouts, This will result in a different error message, but the problem may still be a timeout value which will cause the fence agent to return an error message.

pacemaker clusters: Target stonith-timeout cluster property, pcmk_reboot_timeout stonith device attribute, and fence-agent attributes from pcs stonith describe <agent> as a starting point for tuning timeouts.
cman clusters: Review the fence agent's man page for details of timeout settings for that agent.

Retest fence agent and fence agent parameters: If your fence failure is not a transient failure and is not caused by a timeout, you need to check the fence agent and all of the individual fence agent parameters to determine what may be misconfigured for your current system.

Determine what has changed in your system: In theory, you would have uncovered any issues with initial configuration when you first configured and tested your system. If you fully tested the fence configuration parameters when you set up your system, then it is likely that something has changed since the original configuration. For example, something might have changed in the network configuration at your site. This could change your system operation if, for example, you are using multicast but a network team is not aware of this and has disabled multicast at some time after your initial testing. Deterring what may have changed is a starting point for testing your fence parameters.

Test whether any other node in the system can fence the problematic node: A high-level fencing test to perform is to perform a manual fence to see whether any of the nodes in your system can fence the node for which fencing has failed. If all of the nodes cannot perform the fence action, there is probably something wrong that prevents the device being specified as corresponding to the actual device. For example, in a Pacemaker system you make have an incorrect or missing pcmk_host list.

See: Administrative procedure - Acknowledging manual fencing: RHEL 6 or 7 with pacemaker | RHEL 6 with cman

Test your fence agent parameters: Ultimately, your troubleshooting procedure may involve testing each individual parameter that the fence device touches in the course of operation to determine whether that parameter is set correctly and to analyze what is happening in your system in regards to that parameter. In order to do this, you can manually fence your node passing the same parameters that you configured for your system to see what the command returns.

Update your fence configuration: If you determine that a parameter of your fence configuration requires an update, you can then modify your fence device configuration to reflect this correct. At this point you can run a `pcs status' command to be sure that your system is behaving as designed.

A system running Pacemaker will continue to show errors after they have been resolved. If you are confident you have corrected your issue, you can clear those error messages from your display with a pcs stonith cleanup or a pcs resource cleanup command.

Manually fence to recover from hardware errors, if necessary: If you find yourself in a situation where a piece of hardware has failed, there is no way to reintroduce that node to the cluster quickly. The failed device may not have released its resources, however, and the other nodes in the cluster can not determine that fencing has occurred. In this case, you can manually acknowledge that fencing has occurred.