Diagnostic Procedures for RHEL High Availability Clusters - General Membership and Communication Troubleshooting in RHEL 7, 8

Updated 11 Mar 2022

Overview
- Applicable Environments
- Recommended Prior Reading
Troubleshooting Cluster Membership and Communication
- General Diagnostic Approach
- Recommended Diagnostic Procedures

Overview

Applicable Environments

Red Hat Enterprise Linux (RHEL) 7, 8 with the High Availability Add-On

Troubleshooting Cluster Membership and Communication

General Diagnostic Approach

Examine the state of cluster at each stage: Look at the state of the cluster using utilities that show status output and determine if the cluster is currently stable and quorate on each node. It is important to note that the state you find your system in is not necessarily representative of the state that it was in when the issue first occurred, so at this time the key is to gain an understanding of where things stand now so decisions can be made about what to do next, and then continue checking the status of the cluster after each action. If the cluster is stable to begin with, then the problem may have passed and root-cause analysis may be initiated, whereas if any problem is occurring now, it will first need to be addressed.

When examining the cluster's state, it is important to consider whether nodes can see each other, whether they have quorum, and whether there is any error reported or signs that the state is not what should be expected. If some nodes are not reporting others as online members, it could be a sign they may not be able to communicate with each other. If they're listed as "unclean" then they may have lost contact and are waiting for fencing. If some nodes can't be accessed, they might have been fenced in response to a problem.

Return the cluster to a stable state: You may need to get the cluster out of any problematic situation it is in. If the members have lost contact and cannot function as a cluster, you may not be able to update the configuration or interact with the cluster in the expected manner. The nodes may not be quorate, or the nodes could attempt to fence each other when you perform cluster operations, so tests may be disrupted if carried out without returning to a stable state.

The key to returning to service quickly is often simply rebooting some or all cluster nodes, at least if the cluster has previously been able to start cleanly that way. There may be other methods available to return to service, but they may take time-consuming diagnostics and analysis to identify, which may prolong any costly outages. A reboot of a cluster to a fresh state that is known to work can allow analysis to proceed without the pressure of a ticking clock.

While attempting to return to service, it is important to watch closely as the cluster carries out necessary actions - starting daemons, communicating with other members, establishing the membership ring, beginning to manage resources - and try to identify any point where things start to go wrong. If it does, that should become the focus, and once it is resolved, start the procedure again and see if any new problem arises or the cluster can form and communicate properly. If everything comes up cleanly, then piecing the sequence of events back together can help identify the problem.

Reconstruct the timeline for analysis: It is important to understand what state the cluster started in, what events occurred along the way, and what behaviors resulted. If the problem can be distilled down to a specific "if this then that" story, then it may allow for further testing to reproduce the problem, or could allow for conclusions about the state of a cluster component when the problematic behavior resulted.

This timeline can be constructed from:

Log files
Personal recollection of actions that were carried out, and observations that were made
Command history, and any auditing that may be in place

Recommended Diagnostic Procedures

Stop and start the cluster nodes: If a membership problem or unexpected state arises, stopping the cluster on the problem node or all nodes can help, so that it can be started again under close supervision.

Reboot the cluster: In some instances, the state of the cluster may be unrecoverable or difficult to manage and the situation may benefit from simply rebooting all nodes. This is often true if dlm, clvmd, or gfs2 is involved and a membership or quorum problem surfaces - since these components function in kernelspace and can become blocked waiting on membership issues to be resolved, stopping the cluster or taking action can become difficult or impossible because operations simply hang. Rebooting can get the cluster to a clean state where steps can be taken to troubleshoot or return to service.

Examine membership state from each node: Use pcs status and corosync-quorumtool to determine what each node considers to be the current membership. This can help determine if nodes are communicating and aware of each other.

Examine quorum status from each node: When nodes are coming and going, or having disruptions in contact with each other, then the number of votes visible to any node or nodes may be less than required for the cluster to function. Problems with resources not starting, resources stopping unexpectedly, operations blocking, GFS2 filesystems being inaccessible, or the cluster otherwise failing to take the actions you would expect could all be signs that the members of a cluster may not have quorum. Use corosync-quorumtool and pcs status to determine if each node is quorate, and try to determine if that is influencing the behavior in focus.

Analyze the log files: Dig more deeply into the log files to search for clues as to what went wrong or is going wrong. Again, it is best to start reviewing from a point in time when the state was known or understood - if the cluster was functional for some time and then stopped working, then start back at the last time the cluster was known to be stable, then work forward until something changes; if a problem is occurring when the cluster starts, then find that starting point and iterate forward until an issue is seen. Review the Recommended Diagnostic Procedures below for specifics on log files to review.

RHEL 7 High Availability clusters typically have logs split into both /var/log/messages (also accessible via journalctl) for information applicable to administrators and /var/log/cluster/corosync.log which is more aimed at developers or those performing diagnostics. Some clusters built prior to RHEL 7 Update 2 may alternatively have the latter in /var/log/pacemaker.log, and administrators may change any of these locations in /etc/corosync/corosync.conf and /etc/sysconfig/pacemaker, so a bit of hunting could be required in those cases.

Look for signs of membership changes: A common event that triggers activity or problems withing a cluster is a change in membership, which occurs when the nodes of a cluster can no longer communicate with each other or a new node joins. It may not be obvious that this is the cause of a problem, since a cluster membership change can result in a variety of behaviors that may not be obviously connected to that condition. When troubleshooting problems in a RHEL cluster, it is important to determine whether cluster communication is taking place properly, and whether the membership is stable or has been stable through the incident in focus. If an event or problem can be linked to membership changes, then it may be more easily reproduceable, or an explanation may become more obvious once considered in light of that membership change.

The logs should give a clear indication from corosync when the membership changes, in the form of a "Members" line that lists the nodes that are known to this node by their IDs:

Jan  9 10:03:31 node1 corosync[2986]: [QUORUM] Members[4]: 1 2 3 4

This may be preceded by a report of a "A new membership was formed", or when a node stops responding unexpectedly, by "A processor failed, forming new configuration"](/site/solutions/122293), either of which can indicate the nature of the change.

Membership changes may result in identifiable symptoms such as:

A cluster node gets fenced
The cluster cannot manage resources which, in turn, prevents resources from starting
Nodes are not quorate
Cluster resources will not fail over or recover
LVM commands hang if using clvmd with clustered volumes
gfs2 operations hang
Resources do not behave properly
The pcsd web UI stops working entirely, unable to complete an action. In this situation, one node may reflect a change while the other nodes do not.
Unexpected behavior during cluster start
Cluster status on each node does not show the other nodes online.
pacemaker or its daemons have stopped running
pcs cannot display status or reports the cluster is not running

Investigate communication issues:

Determine if nodes are seeing each other on start, using pcs status and/or /var/log/messages corosync messages that show the "Members" list. If not, explore their network configuration to determine if they should be able to reach each other; examine their firewall policies; examine their routing between them.
Ping nodes of the cluster using their names or IPs exactly as they're recorded in the /etc/corosync/corosync.conf file - failure to ping these from each node could indicate a problem with their name mapping (which should be in /etc/hosts); with their routing between them; or with the network configuration on either node.
Determine the transport protocol in use by the cluster. RHEL 7 defaults to UDPU transport. The transport is configured in /etc/corosync/corosync.conf.
- If udp transport: test multicast communications between nodes
Check firewall policies on hosts or any external firewall. If using firewalld on hosts, service high-availability can open the standard ports for High Availability components.

Consider possibility of transient failures Frequently after stopping and restarting cluster services or rebooting the nodes of a cluster, the cluster comes back up with no issues. Since the nodes can communicate at this point, signs would point to there not currently a network problem. This does not necessarily mean that the original problem was not one of communication since small network problems that result in a network hiccup could result in a node being fenced, but by the time the nodes come back up the hiccup has been resolved.

There are many possible causes for a transient network failure. For example, if your site performs maintenance on a component of your system with a procedure that is designed to have the network down for less than a minute, that could cause your cluster nodes to be fenced if their timeout value is less than the system downtime. If your system starts up again with no issues, you should check what was occurring in the network at the time of your problem. You can check the default information that the system logs, or you may need to ask your site's network administrators if anything was transpiring at their end at the time your system developed problems.

If you cannot determine what caused your temporary outage, you can configure additional system diagnostics to catch the problem if it occurs again. In general, you should configure your system so that it is maximally supportable, incorporating monitoring devices as needed.

Look for signs of node fencing: If the logs or observations indicate a node in the cluster was fenced, you should look to see whether corosync logged anything on that node. If that node logged a membership change itself, that tells you that the node was still responsive when it was fenced, since the system was able to detect that a membership communication failure occurred. In this case your next step is to test your network or search for signs that there was a disruption - possibly in switch logs, or in the logs of other nodes or hosts on the network that may show similar signs.

SBR