High Availability cluster node logs the message "Corosync main process was not scheduled for X ms (threshold is Y ms). Consider token timeout increase."
Environment
- Red Hat Enterprise Linux 6, 7, 8 or 9 (with the High Availability Add-on)
Issue
- A token was lost for a cluster node, and the following messages were logged:
Dec 15 00:10:39 node42 corosync[33376]: [TOTEM ] Process pause detected for 14709 ms, flushing membership messages.
Dec 15 00:10:39 node42 corosync[33376]: [MAIN ] Corosync main process was not scheduled for 14709.0010 ms (threshold is 8000.0000 ms). Consider token timeout increase.
Resolution
Determine what on the system may be utilizing CPU heavily and preventing corosync from being scheduled or executing. For more information on why a cluster node was evicted, see the following solution: What does the message A processor failed, forming new configuration. from corosync mean on a RHEL 6, 7, 8 or 9 High Availability cluster node?
Root Cause
This corosync message is an enhancement that helps to distinguish whether the token was lost due to a network issue or due to a scheduling issue on the cluster node. Resource starvation either on the cluster node or on the hypervisor (in the case of cluster nodes that are virtual machines) is the most common cause of a failure to schedule the corosync main process.
Diagnostic Steps
In most cases it is difficult to know why corosync was not scheduled. If corosync is not scheduled, it will log a message stating for how long it was not scheduled on RHEL 6.5+ cluster nodes:
Oct 1 02:43:21 node42 corosync[2560]: [MAIN ] Corosync main process was not scheduled for 25457.5664 ms (threshold is 24800.0000 ms). Consider token timeout increase.
- Make sure that a recent version of
corosyncis installed. A similar issue was occurring on RHEL 7.0 - RHEL 7.1, wherecorosyncwas not run with realtime priority. For more information see the following issue:corosyncdoes not run with a realtime scheduling priority in a RHEL 7 High Availability cluster - Review the
sardata for when the token loss event occurred and look for any high load or resource starvation. Thesardata is by default captured every 10 minutes and could show no load around time the event occurred. This does not provide conclusive evidence that resource starvation was not occurring at the time of the event. Rather, collection of data every 10 minutes may not be granular enough to detect the issue. If the issue continues to occur, capture supplemental system utilization statistics usingha-resourcemonfor more diagnostic information. - If the cluster node is a virtual machine, check to see if the hypervisor experienced resource starvation when the event occurred. If they hypervisor is under heavy load, that could cause the processes on a virtual machine to pause, including the
corosyncmain process. If you see this message on multiple cluster nodes simultaneously and the nodes are all virtual machines running on the same hypervisor, that is a good indication that some type of resource starvation was occurring on the hypervisor. If the cluster nodes are virtual machines, use thespauseddtool to help determine whether there is resource starvation on the virtual machines or the hypervisor that could lead tocorosyncnot being scheduled: How can we determine whycorosyncwas not scheduled, causing an eviction?
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.