What does log message "entering GATHER state" mean in Red Hat High Availability Add-on?

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux Server(RHEL) 5 with High Availability or Resilient Storage Add-on
  • Red Hat Enterprise Linux Server(RHEL) 6 with High Availability or Resilient Storage Add-on

Issue

  • In the event of a cluster membership change, the cluster enters into a GATHER state. The logs will report messages similar to the following:
    Dec  7 06:30:08 hostX openais[5555]: [TOTEM] entering GATHER state from 9. 
    Dec  7 06:30:10 hostX openais[5555]: [TOTEM] entering GATHER state from 0.
  • What does this messages mean in Red Hat High Availability Cluster?

Resolution

When nodes in a cluster enter the GATHER state, they send join messages out to rest of the cluster in order to form a consensus about the cluster membership. These messages can be interpreted as follows:

0: Consensus timeout expired
   The consensus timer expired. This timer is set on entry to GATHER state and is reset when COMMMIT state is entered. 
   It means the nodes took too long to agree on the membership list.

2: Token timeout in OPERATIONAL (normal) state

3: Token timeout in GATHER state

4: Token timeout in COMMIT state

5: Token timeout in RECOVERY state
  
NOTE:  These states are all related. The Token timer is set when the token is transmitted and if it expires 
       before another message is received it will trigger one of these messages, depending on the state of 
       the protocol at the time.

6: Token failed to receive (ARU count > fail_to_recv_const)
   We failed to receive a copy of our own token.
   This will always be accompanied by a "FAILED TO RECEIVE" message.

7: mcast (data) message received from unknown node while in OPERATIONAL state

8: mcast (data) message received from unknown node while in GATHER state
   Self-explanatory I think. This can be caused by a brief network split where
   a node is forced to leave the cluster but doesn't get fenced before the network
   heals again.

9: Merge detection message received while OPERATIONAL
   When nodes are missing from the membership and there are no naturally-occurring multicast messages
   being sent, the messaging layer will send a periodic merge-detection message to see if any other 
   partitions are operating without being part of this configuration.  This usually just means there
   are nodes missing, but doesn't otherwise signify a problem.  

10: Merge detected in GATHER 
    As above but while the cluster was already in transition from another node joining or leaving.

11: JOIN received while OPERATIONAL

12: JOIN received while in GATHER

13: JOIN received while in COMMIT

14: JOIN received while in RECOVERY
    A JOIN message is sent by a node if GATHER times out, to bring
    a new node into the cluster. These logs indicate
    receipt of one of these messages in GATHER or COMMIT state.

15: Interface changed state
    Often seen at startup, but can happen if an interface is taken down unexpectedly

Root Cause

  • The GATHER state message is normally caused by a network/communication issue within the cluster. But GATHER states can be entered for a number of reasons. The number at the end of the message (from X) indicates why it entered the GATHER state. This is called by "message_handler_memb_merge_detect" when the cluster is attempting to see if there are other nodes are out on the network.

  • GATHER state happens every time a node receives its own token back (meaning its the only node in the ring). During this time, it starts a timer to form and agree on a membership list of nodes in the cluster. If this timer expires, we enter the GATHER state to see if there is another node out there, and attempt to merge with it. After a certain number of times after the node receives its our own token back, it will stop sending it. In which case, these state changes will also stop. Therefore, they are a side effect of the earlier communication problem and subsequent fencing that left this node alone in the cluster.

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.