How can we determine why corosync was not scheduled, causing an eviction?

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 7, 8, 9 (with the High Availability Add On)

Issue

  • A token loss event was observed around the time of a "Corosync main process was not scheduled for X ms" message.
Dec 15 00:10:39 node42 corosync[33376]:   [TOTEM ] Process pause detected for 14709 ms, flushing membership messages.
Dec 15 00:10:39 node42 corosync[33376]:   [MAIN  ] Corosync main process was not scheduled for 14709.0010 ms (threshold is 8000.0000 ms). Consider token timeout increase.

Resolution

The spausedd utility is used for detecting and logging scheduler pause on virtual machines -- i.e., detecting when a process should have been scheduled but was not scheduled. The utility uses steal time to detect whether the vCPUs of the virtual machine are being processed in a timely fashion. Steal time is the percentage of time a virtual CPU (on the VM) spends waiting for a real CPU (on the hypervisor) while the hypervisor is busy servicing another virtual processor. If the steal time is greater than 0 then the hypervisor is stealing CPU from the VM. (in other words, the VM is having to wait to run). The value of steal time (8th column in /proc/stat or in top output called X.X st) can be any double between 0 and 1 (ex: 0.42). The higher the value, the more time the virtual machine is having to wait to run.

If the steal time is greater than zero at the time of the "Corosync main process was not scheduled" message, this points to some type of resource starvation on the hypervisor as the cause of corosync not being scheduled in a timely fashion (and, in some cases, the cause of a fence event).

Should I install spausedd on physical (baremetal) cluster nodes?

No, because the steal time metric is a measurement between the virtual machine (cluster node) and the hypervisor for environments that report steal time. Installing `spausedd` on a physical machine will provide no useful value as the steal time will be 0.
What can spausedd detect?

*This utility is useful for detecting whether the hypervisor (i.e., the host running the VM) is having performance issues that are impacting the virtual machine.*
  • spausedd support VMs where steal time is reported ether by kernel directly or supported by VMGuestLib[1].
  • spausedd is able to detect a pause where corosync was not scheduled in a timely fashion on the VM.
  • spausedd is able to detect the length of the pause on the VM.
  • spausedd should indicate whether the problem is due to an overloaded hypervisor for hypervisors where the kernel has a way of getting steal time (e.g., KVM, VMware [1]).

[1] Support for VMware has been added with the following packages:

Do note that spausedd has not been tested in all possible environments that can be identified with virt-what. There is some known VM environments where spausedd should work: KVM, RHEV, VMware, and AWS (Amazon Web Services). The only way to verify that your VM is supported is by monitoring the st% reported by top or output of the following command (do note that 0 equals not supported or no steal time reported yet, so might have to monitor for some period of time):

# grep 'cpu ' /proc/stat  | awk '{ print $9 }'
Installation

The `spausedd` package can be installed with the following command:
# yum -y install spausedd

The service can be enabled or disabled with systemctl:

# systemctl enable spausedd.service

The service can be started with systemctl:

# systemctl start spausedd.service
# systemctl status spausedd.service
● spausedd.service - Scheduler Pause Detection Daemon
   Loaded: loaded (/usr/lib/systemd/system/spausedd.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2019-08-08 09:42:29 EDT; 10s ago
     Docs: man:spausedd
  Process: 3742 ExecStart=/usr/bin/spausedd -D (code=exited, status=0/SUCCESS)
 Main PID: 3743 (spausedd)
   CGroup: /system.slice/spausedd.service
           └─3743 /usr/bin/spausedd -D

Aug 08 09:42:29 rhel7-1.examplerh.com systemd[1]: Starting Scheduler Pause Detection Daemon...
Aug 08 09:42:29 rhel7-1.examplerh.com systemd[1]: Started Scheduler Pause Detection Daemon.
Aug 08 09:42:29 rhel7-1.examplerh.com spausedd[3743]: Running main poll loop with maximum timeout 200 and steal threshold 10%
Examples

When `spausedd` is started it should log a message similar to the following:
Mar 20 15:01:54 spausedd: Running main poll loop with maximum timeout 200 and steal threshold 10%

There may also be "Not scheduled" messages:

Mar 20 15:02:15 spausedd: Not scheduled for 0.2089s (threshold is 0.2000s), steal time is 0.0000s (0.00%)
Mar 20 15:02:16 spausedd: Not scheduled for 0.2258s (threshold is 0.2000s), steal time is 0.0000s (0.00%)

This means that spausedd got time to run but that it took slightly longer than expected (above the default threshold). Additionally, steal time was 0.00%, so spausedd is running ether on a physical machine (where steal time will always be 0) or on a VM where the host machine is not overloaded (the VM is scheduled on time).

Below is an example of when spausedd detects that the hypervisor is overloaded.

Mar 20 15:08:20 spausedd: Not scheduled for 0.9598s (threshold is 0.2000s), steal time is 0.7900s (82.31%)
Mar 20 15:08:20 spausedd: Steal time is > 10.0%, this is usually because of overloaded host machine

This means that spausedd didn't get the time to run for almost one second. Further, the high steal time means that spausedd was not scheduled because the VM wasn't scheduled by the host machine.

For more information on the corosync process not being scheduled, see the following solution: High Availability cluster node logs the message "Corosync main process was not scheduled for X ms (threshold is Y ms). Consider token timeout increase."

For more information on spausedd, see the man page.

# man spausedd

Root Cause

In some cases a cluster node can be evicted from the cluster is the corosync process is not scheduled in a timely fashion. There are 2 main reason that corosync was not able to be scheduled:

  • An overloaded cluster node which causes resource starvation. To determine if the corosync was not scheduled because of overloaded VM then collect sar or pcp data.
  • A virtualization lag due to overloaded hypervisor (ex. too many VMs, taking snapshots of a VM, etc) that steals CPU from one of the VMs it is hosting. We can use spausedd to detect this issue.
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.