A node was fenced after "A processor failed, forming new configuration" in a High Availability cluster with Pacemaker
Environment
- Red Hat Enterprise Linux (RHEL) 6, 7, 8 or 9 with the High Availability Add On
- Pacemaker
Issue
- What does the message "A processor failed, forming new configuration." from corosync mean on a RHEL 6/7/8/9 cluster node?
May 7 18:37:11 node2 corosync[12718]: [TOTEM ] A processor failed, forming new configuration.
May 7 18:37:23 node2 corosync[12718]: [QUORUM] Members[2]: 2 3
May 7 18:37:23 node2 corosync[12718]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
- Why was a cluster node fenced after "a processor failed" was reported by other nodes?
- How can I troubleshoot what caused a node to fail in my cluster?
- Why was my cluster node fenced?
- Cluster node communication problem.
- Root Cause of cluster communication issue.
- Cluster node fenced need RCA.
- My cluster experienced a token loss and fencing
Resolution
This message can be caused a number of different factors, and the resolution depends on the specific cause. See the following sections for more information.
Root Cause
The message "A processor failed, forming new configuration" from corosync means that the ordering, reliability, and flow control token that the cluster passes from node to node was not received within the expected timeout. In other words, a node failed to send the messages that are expected of a cluster node, and after the configured timeout period (which defaults to 10 seconds in RHEL 6, 1 second in RHEL 7, and 3 seconds on RHEL 8 or later systems with corosync-3.1.0-1.el8 or higher), this failure was recognized by the nodes where this "processor failed" message is logged. Therefore the cluster went into a state where it sent out messages to discover what nodes were alive, formed a new configuration (membership) consisting of those nodes, and passed the new membership list to other daemons that interact with members via the cluster stack.
If the member discovery that follows this "processor failure" indicates one or more previous members is no longer responding, the remaining nodes should take action to fence those that have left, assuming those remaining members are quorate. So in summary, a processor failure means expected communications from one or more nodes have not been received, the cluster went through a process to decide what nodes remain, and this may result in a node being fenced because it is no longer considered a member and it must be disconnected from any shared resources before other nodes are able to recover them.
This can happen for any number of different reasons: the node kernel panicked, the network failed, high load caused the system to become unresponsive, a power failure occurred, someone or something rebooted or powered off the server, etc.
Diagnostic Steps
-
On RHEL 7 hosts, determine if the
corosyncprocess is running with theSCHED_RRrealtime priority or with a normal round-robin priority, which may lead to frequent processor failures and fencing incidents. -
Review
/var/log/messageson the node in question to determine if there are any indications of why the node failed to communicate, such as network driver errors, kernel panics, etc.- By comparing the messages on the node that failed (the one that was fenced) versus the rest of the cluster, some conclusions can be drawn. For example:
- If the failed node recognizes a processor failure or token loss at the same moment as the rest of the cluster, it generally rules out the possibility that it was unresponsive, and strongly indicates a network problem.
- If the failed node does not log anything from a
cluster/corosyncperspective during the timeframe when the rest of the cluster is recognizing a processor failure, it may indicate that node was unresponsive from a processing standpoint in some way (high load, kernel panic, etc). - If
corosyncdoesn't report anything during the timeframe when the node is being removed from the cluster, but other processes on that node are actively logging, then it often indicates CPU or resource starvation that results in some processes being slow or unable to execute, while others are still getting cycles. Following the below steps to analyze resource utilization often helps in these scenarios.
- By comparing the messages on the node that failed (the one that was fenced) versus the rest of the cluster, some conclusions can be drawn. For example:
-
Review
/var/log/sa/sar*files from the date/time in question (if sysstat is installed and thesarcron job set to run) and determine if there were any spikes in resource utilization. This may indicate a rise in load to the point where the node could not process network traffic or processes needing to execute.- If this data does not provide enough detail about what is happening at the time, then consider implementing additional resource monitoring that is useful for diagnosing cluster-related problems
-
Test multicast functionality to determine if there are any signs of communication issues. It is best to let this test run at least 10-15 minutes as many issues can arise only after several minutes have passed giving routing tables a chance to be refreshed or aged out.
- Kdump
- If
kdumpis configured, check for the presence of core files. Note that in a cluster, core dumps may often not have time to complete before the node is fenced without additional configuration. - If
kdumpis not set up, configure it on all nodes. Note that in a cluster, core dumps may often not have time to complete before the node is fenced without additional configuration.
- If
-
Implement additional resource monitoring through the scripting of commands like
top,iostat,vmstat,mpstat,ethtool -S,netstat -s, or other monitoring commands to watch for issues in resource utilization if the issue occurs again. These commands can be scripted with output sent to files, so that they can be collected when an issue reoccurs. It may be beneficial to script them in such a way that only a certain amount of data is kept, so as to avoid using excessive amounts of disk space. -
Enable the debug options and look for the message:
Feb 13 09:17:27 corosync [TOTEM ] The token was lost in the OPERATIONAL state.
-
If the issue is reoccurring frequently and other diagnostic methods are not providing enough data to accurately pinpoint the problem, consider lengthening the time before which a missing node is fenced, for the purposes of allowing it to possibly recover and log information about what it was doing during the window in which it was unresponsive. This could come in the form of resource-monitoring scripts having enough time to execute and capture details about what was starving the cluster stack for resources, or it could come in the form of a kernel core dump completing, or it could be that the node becomes free from whatever was blocking it and logs an error in the
messageslog to indicate what was happening.- This could be accomplished by increasing
post_fail_delay, by implementing a fence delay, or by increasing thetotem tokensetting
- This could be accomplished by increasing
-
Contact Red Hat Global Support Services for additional assistance in diagnosing these issues.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.