How does the communication between the cluster nodes work on a Red Hat Enterprise High Availability Cluster?

Updated

Introduction


This article describes how the underlying protocol called `Totem` is used to provide group messaging for Red Hat High Availability. This article uses the word `processor` for cluster node.

Environment

  • Red Hat Enterprise Linux (RHEL) 5, 6, and 7 with the High Availability Add On
  • corosync for RHEL 6 or 7
  • openais for RHEL 5

What is the Totem Protocol?

Totem is provided by openais in RHEL 5 and corosync in RHEL 6 and 7. Totem is an underlying protocol for providing group messaging used by Red Hat High Availability. Protocols are mathematically proven algorithms. Group messaging means originating (or sending) a message from 1 node in a cluster and all nodes delivering (or receiving) that message. Red Hat High Availability uses a type of messaging called closed process groups(CPG). A CPG essentially serves as a subscription to a channel.

Totem provides specific guarantees about a message called agreed ordering. Agreed ordering means that all processors agree on the ordering of messages as they are delivered to the nodes. Agreement on order occurs at message delivery, not message origination (sending the message). Note message delivery indicates delivering the message to the Red Hat High Availability infrastructure, while message reception means Linux delivering the message to the Totem software. Agreement further occurs on delivery on configuration changes (membership information), which in the context of Totem, identify information about the processor (cluster node).

Totem provides another guarantee called virtual synchrony. Virtual synchrony provides a fault detector mechanism to recover lost messages in the cases of faults. As an example, if there are 3 nodes, and 1 node is delivered messages A and B, and the remaining nodes are only delivered A, the nodes will make different decisions and recover in different ways. Low level cluster infrastructure software will behave incorrectly if messages are delivered to some nodes but not others. This is complicated by the fact that a node can fail. At the end of configuration (membership) prior to a new configuration forming, all messages that can be recovered are recovered under virtual synchrony guarantees.

Red Hat High Availability further uses an abstraction called a Closed Process Group (CPG). It is called a process group because membership information, including processor identifier and process (Linux PID) identifier are sent to the processors during configuration changes. It is called closed because the internal infrastructure software must subscribe to the group via a named identifier (typically a character string).

How does the Totem Protocol work?


There are four states to the `Totem` protocol which are:
  • GATHER
  • COMMIT
  • RECOVERY
  • OPERATIONAL.

When the single node starts, Totem starts the membership protocol. The membership protocol sends a message which includes the list of processors which are good and which are bad. Since there is only one processor (cluster node) known to the software and it is good, it goes in the good list of the membership protocol. A timer is started called the gather timer and enters the GATHER state. When this timer expires, the existing membership list will be checked for agreement by all nodes. Since there is only one node, it will agree as soon as the gather timeout expires. This will result in a transition to the COMMIT state.

The COMMIT state is used to commit to a specific membership. During the COMMIT state a COMMIT token is originated from the node with the lowest IP address. Since there is only one node, it will originate, and immediately be delivered to that same node. The COMMIT token must rotate twice to gather enough information to validate the membership. If any failure occurs during the COMMIT phase, the protocol enters the GATHER phase again and tries to form a new membership. To detect failures during the COMMIT phase a token timer is started. If the token timer expires before a token is delivered to that node, the processor enters the GATHER state.

The RECOVERY state is used to provide virtual synchrony guarantees of recovery of messages at the end of the configuration. During RECOVERY, a token is originated to store recovery messages. A token timer is started to detect failures of the RECOVERY state. If no token is received during the RECOVERY phase, the processor enters the COMMIT state, updating its list of recovered messages. During recovery every processor puts its missing messages in the token transmitted between nodes. The nodes that have a copy of the message will retransmit them. Once all messages that can be retransmitted are retransmitted, the processor will finish RECOVERY. Note there could be missing messages at the end of configuration because a processor completely failed and no processor has a copy of the message. This is called a hole and is the correct operation. Holes can only occur for processors that have left the configuration. All processors transitioning from one configuration to the next will maintain agreed order of messages and membership.

Once RECOVERY is finished, the processor enters the OPERATIONAL state. In the OPERATIONAL state, a token is originated from the lowest IP address and is immediately delivered to itself. Before originating the token, a token timer is started. If a token is not received within the elapsed time, the processor enters the GATHER state. In the case of low message origination rates, the token will be slowed as not to consume too much CPU time.

The token contains a sequence number. When a processor wishes to originate a message, it tells Totem which forms a UDP message. The UDP message is then either multicasted, or in the case of udpu unicasted via UDP/IP to every node in the cluster. The sequence number is used to provide agreed ordering of messages and an identifier for recovery. On receipt of a token, the processor checks its list of received messages versus the sequence id in the token. If it has detected missing messages, they are added to a retransmission list. On receipt of the token, any messages in the retransmit list that the processor has a copy of will be retransmitted. Totem will not free or garbage collect messages until they have been delivered to all nodes. This guarantee is provided by something called the ARU(all received to) processing logic to ensure virtual synchrony.

With multiple nodes, the token rotates and operates in the same ways as with one node, except a sorted list of processors is created for the token to traverse during the COMMIT phase. All other operations remain the same.

The membership protocol in multi-node configurations is designed to pare down the list of members as they are defined as faulty. They are defined as faulty if at the end of a GATHER session, some processors don't know about a specific processor but others do. In the case where there is disagreement about a processor, it is added to a "bad" list and the GATHER process is attempted again. Note eventually a GATHER process could pare the list down to 1 processor and result in single processor configurations.

Every time a multicast message is received by totem, all processors that haven't heard that processor before will enter GATHER and start the normal membership protocol fresh, while maintaining virtual synchrony.

Reference

Category
Components
Tags
Article Type