What can I do to determine what caused the token to be lost and/or a node to be fenced in my RHEL 5 High Availability cluster?

Solution Verified - Updated 15 Aug 2018

Environment

Red Hat Enterprise Linux 5 with the High Availability Add On

Issue

Why was my node fenced?
One of the cluster node rebooted after totem token loss, please let us know the reason for token loss
A node was fenced when a token was lost on the surviving node.

  Apr 19 17:29:26 node1 openais[4997]: [TOTEM] The token was lost in the OPERATIONAL state.

How can I determine what caused my cluster to report a token was lost, and subsequently fence a cluster node?
How do you recreate token lost events in a test environment?
We need the instructions to get a token lost event to occur.
What are different ways to trigger token lost events in cluster environment?

Root Cause

The totem's token can be thought of as the cluster's heartbeat, and the token setting is the amount of time the cluster will wait for that message to be received from another node declaring it as lost. If the cluster node is unable to receive totem token from other node in cluster within totem token timeout period then the failed node will be fenced out of the cluster. fence operation will reboot the failed node so that it can rejoin the cluster with a clean state.
A token loss is a fairly generic event that indicates connectivity between two or more nodes was disrupted or a node stopped responding. openais and cman use a unicast token that is passed between nodes to determine which are alive and participating in the cluster. When a node crashes, loses its network connectivity, is overcome with high load, or has any number of other problems that disrupt its ability to communicate, it can lead to a token loss, which can cause it to be fenced.
Any activity which prevent tokens from arriving on time to a cluster node could cause or trigger tokens to be lost. Below is a list of some of the most common reasons tokens are lost(but not an exhaustive list):
- network failure by pulling a cable
- high load on cpu
- kernel panic
- high network congestion
- memory pressure
- iptables rule to block communication traffic
- sudden time jumps on cluster node (for RHEL 5)

Diagnostic Steps

Review /var/log/messages on the node in question to determine if there are any indications of why the node failed to communicate, such as network driver errors, kernel panics, etc.
- By comparing the messages on the node that failed (the one that was fenced) versus the rest of the cluster, some conclusions can be drawn. For example:
  - If the failed node recognizes a processor failure or token loss at the same moment as the rest of the cluster, it generally rules out the possibility that it was unresponsive, and strongly indicates a network problem.
  - If the failed node does not log anything from a cluster/openais perspective during the timeframe when the rest of the cluster is recognizing a processor failure, it may indicate that node was unresponsive from a processing standpoint in some way (high load, kernel panic, etc).
  - If openais doesn't report anything during the timeframe when the node is being removed from the cluster, but other processes on that node are actively logging, then it often indicates CPU or resource starvation that results in some processes being slow or unable to execute, while others are still getting cycles. Following the below steps to analyze resource utilization often helps in these scenarios.
Review /var/log/sa/sar* files from the date/time in question (if sysstat is installed and the sar cron job set to run) and determine if there were any spikes in resource utilization. This may indicate a rise in load to the point where the node could not process network traffic or processes needing to execute.
- If this data does not provide enough detail about what is happening at the time, then consider implementing additional resource monitoring that is useful for diagnosing cluster-related problems

Test multicast functionality to determine if there are any signs of communication issues. It is best to let this test run at least 10-15 minutes as many issues can arise only after several minutes have passed giving routing tables a chance to be refreshed or aged out.
Kdump
- If kdump is configured, check for the presence of core files. Note that in a cluster, core dumps may often not have time to complete before the node is fenced without additional configuration.
- If kdump is not set up, configure it on all nodes. Note that in a cluster, core dumps may often not have time to complete before the node is fenced without additional configuration.

Implement additional resource monitoring through the scripting of commands like top, iostat, vmstat, mpstat, ethtool -S, netstat -s, or other monitoring commands to watch for issues in resource utilization if the issue occurs again. These commands can be scripted with output sent to files, so that they can be collected when an issue reoccurs. It may be beneficial to script them in such a way that only a certain amount of data is kept, so as to avoid using excessive amounts of disk space.
Look for the message:

Feb 13 09:17:27 openais [TOTEM ] The token was lost in the OPERATIONAL state.

If the issue is reoccurring frequently and other diagnostic methods are not providing enough data to accurately pinpoint the problem, consider lengthening the time before which a missing node is fenced, for the purposes of allowing it to possibly recover and log information about what it was doing during the window in which it was unresponsive. This could come in the form of resource-monitoring scripts having enough time to execute and capture details about what was starving the cluster stack for resources, or it could come in the form of a kernel core dump completing, or it could be that the node becomes free from whatever was blocking it and logs an error in the messages log to indicate what was happening.
- This could be accomplished by increasing post_fail_delay, by implementing a fence delay, or by increasing the totem token setting
Contact Red Hat Global Support Services for additional assistance in diagnosing these issues.

SBR

Clusterha

Product(s)

Red Hat Enterprise Linux

Components

cman
openais

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.