High Availability cluster node logs the message "Corosync main process was not scheduled for X ms (threshold is Y ms). Consider token timeout increase."

Solution Verified - Updated 5 Aug 2024

Environment

Red Hat Enterprise Linux 6, 7, 8 or 9 (with the High Availability Add-on)

Issue

A token was lost for a cluster node, and the following messages were logged:

Dec 15 00:10:39 node42 corosync[33376]:   [TOTEM ] Process pause detected for 14709 ms, flushing membership messages.
Dec 15 00:10:39 node42 corosync[33376]:   [MAIN  ] Corosync main process was not scheduled for 14709.0010 ms (threshold is 8000.0000 ms). Consider token timeout increase.

Resolution

Determine what on the system may be utilizing CPU heavily and preventing corosync from being scheduled or executing. For more information on why a cluster node was evicted, see the following solution: What does the message A processor failed, forming new configuration. from corosync mean on a RHEL 6, 7, 8 or 9 High Availability cluster node?

Root Cause

This corosync message is an enhancement that helps to distinguish whether the token was lost due to a network issue or due to a scheduling issue on the cluster node. Resource starvation either on the cluster node or on the hypervisor (in the case of cluster nodes that are virtual machines) is the most common cause of a failure to schedule the corosync main process.

Diagnostic Steps

In most cases it is difficult to know why corosync was not scheduled. If corosync is not scheduled, it will log a message stating for how long it was not scheduled on RHEL 6.5+ cluster nodes:

Oct  1 02:43:21 node42 corosync[2560]:   [MAIN  ] Corosync main process was not scheduled for 25457.5664 ms (threshold is 24800.0000 ms). Consider token timeout increase.

Make sure that a recent version of corosync is installed. A similar issue was occurring on RHEL 7.0 - RHEL 7.1, where corosync was not run with realtime priority. For more information see the following issue: corosync does not run with a realtime scheduling priority in a RHEL 7 High Availability cluster
Review the sar data for when the token loss event occurred and look for any high load or resource starvation. The sar data is by default captured every 10 minutes and could show no load around time the event occurred. This does not provide conclusive evidence that resource starvation was not occurring at the time of the event. Rather, collection of data every 10 minutes may not be granular enough to detect the issue. If the issue continues to occur, capture supplemental system utilization statistics using ha-resourcemon for more diagnostic information.
If the cluster node is a virtual machine, check to see if the hypervisor experienced resource starvation when the event occurred. If they hypervisor is under heavy load, that could cause the processes on a virtual machine to pause, including the corosync main process. If you see this message on multiple cluster nodes simultaneously and the nodes are all virtual machines running on the same hypervisor, that is a good indication that some type of resource starvation was occurring on the hypervisor. If the cluster nodes are virtual machines, use the spausedd tool to help determine whether there is resource starvation on the virtual machines or the hypervisor that could lead to corosync not being scheduled: How can we determine why corosync was not scheduled, causing an eviction?

SBR

Clusterha

Product(s)

Red Hat Enterprise Linux

Components

corosync

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.