Exploring RHEL High Availability's Components, Concepts, and Features - Member Communication and Heartbeat Monitoring

Updated 7 Jul 2023

Applicable Environments

Red Hat Enterprise Linux (RHEL) 5, 6, 7, 8 and 9 with the High Availability Add-On
- There may be minor variations in the implementation of this concept across these releases. This article attempts to describe the general concepts, whereas implementation, configuration, or other aspects may differ slightly in each release.
RHEL 6, 7, 8 and 9: corosync component installed
RHEL 5: openais component installed

Concept Overview

Basics

The RHEL High Availability Add-On is aimed at managing shared applications or resources across multiple servers for the purposes of minimizing downtime, handling failures, balancing load, maintaining data integrity through failures, and more. To achieve this, at a basic level there must be coordination and communication across multiple RHEL servers so that work can be carried out.

corosync and openais serve these communication and membership functions in their respective RHEL releases. That is, the High Availability cluster is started on each node in the cluster through one of various means available in that release (pcs, ccs, Conga, cman, etc), and that loads the core cluster engine, which then establishes a membership with any other available members for this cluster using the established protocols in corosync or openais. Once those nodes are participating in a membership, also known as a "configuration" or a "ring", then they continuously transmit various messages around amongst themselves to indicate their status, communicate and monitor the presence of members through heartbeats, relay application states or data, and coordinate any membership transitions.

The way in which these communications are implemented and manifest can be an important detail to administrators and engineers working with High Availability clusters, because they are central to almost all work and behaviors that are carried out. If a node is removed from the membership and fenced, knowing how those communications are passed around can reveal key clues about what caused the incident. If nodes are unable to establish a membership together, then knowing how the transport protocols work and what the environment needs to provide to them can help solve that problem. If a component of the cluster is exhibiting poorer-than-expected performance, knowing how that component messages its details can help identify or eliminate possible contributing factors.

This article aims to outline the details of how nodes communicate in a cluster.

Messages vs Tokens

There are two general types of communications at the basic membership layer of the cluster:

ORF tokens: "Ordering, Reliability, and Flow Control" tokens contain stateful information about the cluster, indicating how the membership is laid out and details about the messages that have been transmitted or received. These tokens serve as the "heartbeat" that allows nodes to monitor if all members are still communicating properly. These are udp or udpu transports on RHEL 5, 6 and 7 or knet transport on RHEL 8 and 9 for cluster heartbeat packets sent from one node to another directly.
Messages: Messages are transmissions sent via the transport protocol to all members equally using either a unicast, broadcast, multicast or kronosnet facility. These messages are used for the formation of a membership, in that when a node starts within the cluster, it will send a message to all available members to let them know it is ready to join the membership, and that sets in motion a process of establishing the list of members that will function together in the cluster. Messages are also used to transmit clustered-application states and information participating in a Closed Process Group (CPG), generally meaning they are coordinating across multiple members of the cluster.

Membership Formation

When a node first starts in the cluster - by having corosync or openais spawned - it will transmit a "join" message to all listening members using the configured transport protocol. Any members that are present - they may have been members previously, or may be joining now as well - will change their state. Receipt of this messages triggers what is often seen in the system logs as the "gather" state, triggering the active members to participate in communicating back and forth to establish which members are present and which other nodes each one is in contact with. This process waits up to the "consensus timeout", at which point the nodes that all have a consistent view of each other will establish themselves in a membership together.

This membership is represented by a list of nodes that is ordered the same way on each member - the ordering is derived sequentially from the member IP addresses. This list is then used to establish a "ring" through which each node communicates with the next in the list and the last node in the list communicates back with the first. This ring formation allows for quick identification when a member has stopped communicating, as is covered below.

In RHEL 8 or later cluster installations with knet protocol, the ring order is derived from the nodeid, not the IP address. Due to this, the IP addresses (links) can be changed dynamically.

Token Passing During Normal Operation

Once nodes have established this membership, or ring, one of them will originate the first ORF token. These tokens are packets that are always sent directly from one node to the next, as opposed to using the transport protocol to send it to all members. This direct node-to-node communication is often referred to as "unicast", in contrast to messages which would be transmitted over "multicast" or "broadcast" facilities - unicast means one-on-one communication between nodes.

After the first node sends the token to the next node in the ring, that node will send one to the next in the ring, and so on. This token serves as the heartbeat that allows a node to detect if one member is no longer responsive, and also stores information about messages that have been sent and received, so that nodes can establish whether they've received everything, whether a retransmit is needed, and more.

Detection of Missing Members through Heartbeats

As noted, the ORF token allows nodes to recognize when a member of the ring is unable to communicate for some reason - such as being unresponsive/hung, having crashed, experiencing a network disruption, or other serious problem preventing communication. The cluster components hooked into corosync or openais need to be aware of a loss of a member so they can carry out their High Availability duties in response, so that component uses these tokens to monitor the responsiveness of other members and decide when its time to re-discover which members are present and possibly update its membership list.

For example, let's say the ring is laid out such that node 1 sends the token to node 2 who sends the token to node 3 who sends it back to node 1. If node 1 sends the token to node 2 but then never receives a token after that, node 1 will periodically retransmit its earlier token just in case it got dropped along the way. After waiting a configured amount of time - the "totem token timeout" - node 1 will declare that some node has failed to send a token. Such a discovery will send this node - and all nodes that recognized a "token loss" - back into their "gather" state as described in the Membership Formation process above. In that state, these remaining members will send out a join message to all available nodes using the transport protocol, causing them all to communicate and establish a new membership.

If that membership has changed, then corosync will declare that a "processor failed" (processor=node), or openais will declare a "token was lost", then will log the new membership while it notifies other hooked-in components that the membership has changed. Those other components will do what they need to do in response, such as fencing, recover resources, etc.

Application Messaging with CPG

Throughout the life of the cluster, the transport protocol may be used to send additional messages to all members by applications using the Closed Process Group (CPG) functionality. That is, you may have an application such as pacemaker, rgmanager, or clvmd that will run on multiple members of the cluster and needs to keep track of activities across them so they can coordinate their work. pacemaker needs to know what is happening on each node and be notified of events, clvmd needs to lock volumes to prevent multiple nodes from writing conflicting metadata, and so forth. CPG gives these applications a simple way to send a message and have it delivered to all nodes that have had an application join a particular named group, without having to manage the details of that transmission and membership tracking themself. pacemaker can say "send message with contents C to all members", and CPG can take care of getting it to pacemaker listening on nodes 2 and 3.

The transport protocol is responsible for ensuring delivery of that message to all members. If corosync or openais is unable to deliver that message, it will keep retransmitting from a node that has the message contents (such as the original node that sent it) until it is delivered, or until a maximum number of retries is exceeded and the cluster declares a fault has occurred, triggering reestablishment of the membership to try to fix the issue or remove any problematic nodes.