A Red Hat High Availability cluster forms a new full membership immediately after token loss without fencing a cluster node

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 7 (with the High Availability Add-on)
  • Red Hat Enterprise Linux 8 (with the High Availability Add-on)

Issue

  • corosync reports token loss for a given node. Then a new membership forms with that node present, and fencing is not attempted.
  • A pacemaker cluster can recover from extremely short periods of token loss without fencing a node.

Resolution

This is intended behavior. A cluster can experience token loss without an effective membership change or fencing if the failed node recovers before the corosync consensus timeout expires. This can happen due to a brief network issue or due to a brief system hang on the failed node, among other causes.

For example, assume that node1 reports a processor failure for node2. That is, node1 has not received a corosync totem token message from node2 within the token timeout.

But a new membership does not form until the consensus timeout expires, and fencing due to token loss doesn't occur until a new membership forms. So if node1 receives a token message from node2 after the token timeout but before the consensus timeout, then node2 is part of the new membership, and pacemaker won't initiate fencing against node2.

Root Cause

When a node's token hasn't been received within the corosync totem.token timeout, other nodes in the cluster declare a processor failure for that node. Each node then sends out a join message.

The totem.consensus timeout specifies how long to wait for consensus to be achieved before forming a new corosync membership. If an unresponsive node becomes responsive and is able to process the join message before the consensus timeout expires, then it can participate in establishing the new membership. If the new membership forms with the failed node included, the cluster will recover from the reported token loss without losing any nodes. A failed node that is part of the new membership is not considered to have left the cluster, and it does not need to be fenced.

For more information on the consensus timeout, see the following solution: How do I configure the consensus timeout in a Red Hat High Availability cluster?

Diagnostic Steps

  • An example scenario is shown below. Note that this is a two-node cluster and the new membership contains both nodes, even though one of them just underwent token loss.
# Node 1
Feb 22 13:28:59 node-1 corosync[1234]: [MAIN  ] Corosync main process was not scheduled for 199
8.3500 ms (threshold is 800.0000 ms). Consider token timeout increase.
Feb 22 13:28:59 node-1 corosync[1234]: [TOTEM ] A processor failed, forming new configuration.
Feb 22 13:28:59 node-1 corosync[1234]: [TOTEM ] A new membership (10.10.10.10:1111) was formed. Members
Feb 22 13:28:59 node-1 corosync[1234]: [QUORUM] Members[2]: 1 2
Feb 22 13:28:59 node-1 corosync[1234]: [MAIN  ] Completed service synchronization, ready to provide service.

# Node 2
Feb 22 13:28:58 node-2 corosync[1230]: [TOTEM ] A processor failed, forming new configuration.
Feb 22 13:28:59 node-2 corosync[1230]: [TOTEM ] A new membership (10.10.10.193:1111) was formed. Members
Feb 22 13:28:59 node-2 corosync[1230]: [QUORUM] Members[2]: 1 2
Feb 22 13:28:59 node-2 corosync[1230]: [MAIN  ] Completed service synchronization, ready to provide service.
SBR
Components

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.