How to change totem token timeout value in a RHEL 5, 6, 7, 8 or 9 High Availability cluster?
Environment
- Red Hat Enterprise Linux 5 (with the High Availability Add-On)
- Red Hat Enterprise Linux 6 (with the High Availability Add-On)
- Red Hat Enterprise Linux 7 (with the High Availability Add-On)
- Red Hat Enterprise Linux 8 (with the High Availability Add-On)
- Red Hat Enterprise Linux 9 (with the High Availability Add-On)
Issue
- I have a Red Hat Enterprise Linux cluster using the default
tokentimeout value of 10 seconds. I want to increase this value to make the cluster more resilient against unresponsive nodes, network interruptions, or similar delays. - I need to modify the cluster poll rate.
- How do I change/increase/modify cluster heartbeat token timeout value?
- How do I avoid problems due to a temporary network glitch in a cluster?
- How do I prevent cluster fencing due to temporary network split?
- How do I increase the amount of time required to detect failed nodes?
- How long will a Red Hat cluster wait before it fences the unresponsive node?
Resolution
NOTE: Increasing the cluster token timeout increases the amount of time required to detect failed nodes, and therefore the amount of time for cluster services to recover from a node failure. This setting should only be changed in situations where the implications of how it affects the operation of the cluster are fully understood. Red Hat only tests and supports totem token values in the range of 5000 ms to 300000 ms (5 seconds to 5 minutes) for RHEL 5 and 6 HA Clusters utilizing cman/rgmanager.
The value you choose is environment specific, and can be dependent on another of factors both on the hardware and application side. For example virtualized environments susceptible to CPU pauses from steal or cloud environments which may become busy due to vendor backup processes, may require more time ( 15000 to 30000ms ) compared to baremetal which may operate fine at default timeouts. It is recommended to work with your hardware vendor as well as application workloads when deciding on an appropriate timeout value.
RHEL 8 or 9
A new pcs command was added with errata RHEA-2021:1737 with the following package(s): pcs-0.10.8-1.el8, pcs-snmp-0.10.8-1.el8 or later (See: bz1774143). This will modify the token timeout and can be done while corosync is running on all nodes.
# pcs cluster config update totem token=10000
Note: this will only update the token value. If the cluster configuration includes parameters derived from the token value, a restart/reload is needed in order to recalculate them in the running configuration.
RHEL 7, 8 or 9
**With the package corosync-3.1.0-1.el8 or later if no token value is specified in the corosync configuration, the default is 3000 ms, or 3 seconds for a 2 node cluster, increasing by 650ms for each additional member. This was added with errata [RHBA-2021:1780](https://access.redhat.com/errata/RHBA-2021:1780)**
If no token value is specified in the corosync configuration, the default is 1000 ms, or 1 second for a 2 node cluster, increasing by 650ms for each additional member. To use a value other than the default, add or edit the token line in the totem stanza of /etc/corosync/corosync.conf. For example:
totem {
version: 2
secauth: off
cluster_name: rhel7-cluster
transport: udpu
rrp_mode: passive
token: 5000 <---
}
Use the following steps when changing the token timeout value:
-
Add or edit the
tokenattribute tocorosync.confon all nodes as described above. Alternatively, edit thetokenattribute on one node and then propagate the updatedcorosync.confto the rest of the nodes as follows:# pcs cluster sync -
Reload corosync. This command can be run from one node to reload corosync on all nodes and does not require a downtime.
# pcs cluster reload corosync -
Confirm the changes are in effect using the following command:
# corosync-cmapctl | grep totem.token runtime.config.totem.token (u32) = 5000 ...
RHEL 6
If no token value is specified in the cluster configuration, the default is 10000 ms, or 10 seconds. To use a value other than the default, add or edit the totem line in /etc/cluster/cluster.conf as a child of the <cluster> element. For example:
<cluster config_version="9" name="rhel6-cluster">
<totem token="21000"/>
[...]
</cluster>
Use the following steps when changing the token timeout value:
-
Add or edit the
<totem/>element incluster.confas described above. -
Propagate the cluster configuration changes to other cluster nodes as described in this solution:
How can I propagate changes I've made to /etc/cluster/cluster.conf to all the nodes in my cluster? -
Confirm the changes are in effect using the following command.
# corosync-objctl | grep totem.token totem.token=21000
RHEL 5
If no token value is specified in the cluster configuration, the default is 10000 ms, or 10 seconds. To use a value other than the default, add or edit the totem line in /etc/cluster/cluster.conf as a child of the <cluster> element. For example:
<cluster config_version="9" name="rhel5-cluster">
<totem token="21000"/>
[...]
</cluster>
Use the following steps when changing the token timeout value:
-
Add or edit the
<totem/>element incluster.confas described above. -
Propagate the cluster configuration changes to other cluster nodes as described in this solution:
How can I propagate changes I've made to /etc/cluster/cluster.conf to all the nodes in my cluster? -
Stop the cluster daemons on all nodes.
-
Start the cluster daemons on all nodes.
-
Confirm the changes are in effect using the following command.
# grep Token /var/log/messages [...] openais[11294]: [TOTEM] Token Timeout (11000 ms) retransmit timeout (495 ms)
Notes
-
A full cluster restart is needed to change the
tokentimeout in RHEL5. For a list of the services that need to be restarted, see What services/daemons are available for Red Hat Enterprise Linux Cluster? -
Token timeout change does not require a full cluster restart. Note that in RHEL 6 after the modification of
/etc/cluster/cluster.conf, the file should be propagated to all cluster nodes.
Root Cause
totem token is the number of milliseconds before a node will consider the token to have been lost, at which point it will initiate procedures to determine which nodes are still responsive and should remain members versus which nodes are missing and should be removed. In a basic sense, this setting controls how long a node can fail to respond before action is taken to remove it from the cluster.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.