RHV: How does cluster scheduling policy work?

Solution Verified - Updated 28 Oct 2021

Environment

Red Hat Virtualization 4.4

Issue

How does cluster scheduling policy work?

Resolution

A cluster policy works by using a set of rules to determine how to schedule virtual machines amongst the hosts in the cluster.

A Virtual Machine is scheduled when it is started or migrated. The scheduling process aims to determine a host for the VM to start or to be migrated to. It works in 3 steps:
1. Filters - One or more more enabled filter module rule out hosts based on some criteria
2. Weights - One or more enabled weight module calculates scores for the remaining hosts from most suitable (lowest score) to less suitable (highest score).
3. Ranking - Based on the scores, ranks the host from first to last for each weight module. The host with lowest total accumulated ranks is picked and gets the VM.

Finally, if a Load Balancer is in use by the policy, it can trigger automatic migrations of already running VMs so that VMs can be migrated to automatically accomplish the objective of the Load Balancer.

The Load Balancer rules are checked every 1 minute by default, while the Filters, Weights and Ranking logic happens on demand when Virtual Machines need to be scheduled, either by power on or live migrations.

So essentially, each policy has 3 sets of logic rules, please see their names and the description of each in the table below:

Module	Description
Filters	These rules are used to determine if a certain host is considered or not when deciding where run a VM. They filter hosts based on requirements such as minimum CPU or RAM. So, if a host does not meet the minimum requirements then it is not even considered for running the VM.
Weights	After some hosts are ruled out due to the filters above, the remaining ones are scored, ranked and the one with lowest rank receives the VM. These weight rules specify which attributes are taken into account and how important (weight) they are in order to calculate the hosts scores. For example, `HA` weights hosts according to their availability score. If `HA` weight is higher then the others then it has a higher impact in the score than the others.
Load Balancer	This logic determines which hosts are under or over-utilized, and initiates migrations accordingly. For example, the `OptimalForPowerSaving` balancer is used to consolidate loads into fewer hosts. Each balancer has a few properties to further define and customize their behavior. The Load Balancer will not start or stop VMs, only initiate migrations to satisfy its rules.

The policies using the 3 criteria above can be customized and/or created in Admin Portal->Configure->Cluster Policies. However, RHV ships with 5 pre-defined policies that should cover most use cases. They are named none, cluster_maintenance, evenly_distributed, vm_evenly_distributed and power_saving. Please see explanations for each of the policies RHEV ships with by default in the RHV 4.4 Documentation - Technical Reference - LOAD BALANCING, SCHEDULING, AND MIGRATION

Filters Details

The available filters are described in the table below. Please note that Mandatory filters (last column on the table below) are always enabled and not available for user configuration.

Filter	Description	Parameters	Used by default policy	Mandatory
ClusterInMaintenance	Host cannot receive the VM unless the VM is HA or being Migrated, prevents VMs from starting	None	cluster_maintenance	No
Compatibility-Version	Host can only receive the VM if it supports the Compatibility Version of the VM	None	All	Yes
CPU	Host can only receive the VM if it has equal or more CPU cores than the VM. The "Count Threads as Cores" cluster setting influences this calculation	None	All	No
CPU-Level	Host can only receive the VM if it provides all CPU flags required by the VM CPU model setting (if different from Cluster setting). Note: VMs with Pass-Through CPU will always validate on start, but on migration the destination host must have all the flags that the VM was started with on the first host	None	All	Yes
CPUOverloaded	Host is filtered out if CPU usage is above the HighUtilization threashold for at least an interval of CpuOverCommitDurationMinutes time	HighUtilization, CpuOverCommitDurationMinutes	All	No
CpuPinning	Host can only receive the VM if it can satisfy VM CPU pinning requirements	None	All	Yes
Emulated-Machine	Host can only receive the VM if it provides the required emulated machine type	None	All	No
HA	Host can receive the Hosted-Engine VM only if it has a Hosted-Engine HA score higher than zero and equal or higher than the score of the current host running it	None	All	No
HostDevice	Host can receive the VM only if it has passthrough enabled and can provide the host devices required by the VM	None	All	Yes
HostedEngineSpares	Host can receive the VM only if there will still be enough available memory in the cluster to provide fail-over capability for the Hosted-Engine VM in case the current host fails	HeSparesCount	All	No
HugePages	In case the VM is configured to use Huge Pages, the host can only receive the VM if it can provide the number of Huge Pages necessary for the VM memory	None	All	No
InClusterUpgrade	Host can only receive the VM if its software version is equal or higher than the current host running the VM	None	unused, legacy policy	No
MDevice	In case the VM is configured with mediated devices (mDev), host can only run the VM if it is able to provide the required mDev	None	All	Yes
Memory	Host can only run the VM if it has enough free physical and scheduling memory to satisfy the VM memory requirements	None	All	No
Migration	Prevents migration to the same host, in case of DNS error for example	None	All	No
Migration-Tsc-Frequency	If the VM is of type High Performance, the Host can receive the VM only if it has the same TSC frequency of the host currently running the VM	None	All	No
Network	Host can only receive the VM if it can provide all the networks required by the VMs NICs and Display Network	None	All	No
NUMA	If the VM is configured with vNUMA strict or interleave, the Host can only receive the VM if its NUMA nodes can accomodate the Virtual Machine vNUMA nodes in terms or resources	None	All	No
PinToHost	Host can only receive the VM if the VM is pinned to that host	None	All	Yes
Swap	Host can only receive the VM if it is not swapping above the threshold	MaximumAllowedSwapUsage	All	No
VmAffinityGroups	Host can only run the Virtual Machine if it does not break hard affinity rules for VMs	None	All	No
VM leases ready	In case the VM is configured with storage leases, the host can only receive the VMif it is ready to support VM leases	None	All	Yes
VmToHostsAffinityGroups	Host can only run the Virtual Machine if it does not break hard affinity rules with Hosts	None	All	No

Weights Details

The available weights are described in the table below. Note MaxSchedulerWeight referenced in the table below is set by default to 1000 and lowest score is best.

Weight	Description	Parameters	Scoring	Used by default policy[weight]
CPU and NUMA pinning compatibility	If the VM has vNUMA and pinning, prefer hosts that the CPU pinning won't clash with the vNUMA pinning	None	1 if it fits or MaxSchedulerWeight if it doesn't	All[5]
CPU for high performance VMs	Prefers hosts that have more or equal number of sockets, cores and threads than the VM	None	1 if it fits or MaxSchedulerWeight if not	All[4]
Fit VM to single host NUMA node	If the VM does not have vNUMA, prefer hosts that can fir the VM in a single physical NUMA	None	1 if it fits or MaxSchedulerWeight if not	-
HA	For the HostedEngine VM, prefers hosts that have higher HA Score	None	Normalizes the HA Scores to a value between 1 and MaxSchedulerWeight, returning score 1 to the highest HA score and MaxSchedulerWeight to the lowest HA score	All[1]
InClusterUpgrade	Prefers migrating VM to host with newer or equal OS versions than the host running the VM, penalizing hosts with older OS	None	0 if the host OS is same or more recent or 100,000 if at least same major release or 1,000,000 if even older	unused, legacy policy
OptimalForCpuEvenDistribution	Prefers hosts that the CPU usage is lower	None	1 for the host(s) with lowest CPU usage, increasing up to maxSchedulerScore for the highest CPU usage	none[2], cluster_maintenance[2] and evenly_distributed[2]
OptimalForCpuPowerSaving	Prefers hosts that the CPU usage is higher, but lower than HighUtilization	HighUtilization	1 for the host(s) with highest CPU usage, increasing up to maxSchedulerScore for the lowest CPU usage, and maxSchedulerScore if CPU usage is above HighUtilization	power_saving[2]
OptimalForEvenGuestDistribution	Prefers host with less VMs running	SpmVmGrace	The number of VMs running on the host as the score, offset by SpmVmGrace if the Host is SPM	vm_evenly_distributed[2]
OptimalForHaReservation	For clusters with HA reservation enabled and HA VMs, prefers hosts with less HA VMs	ScaleDown	0 if the VM is not HA or Cluster HA reservation is disabled, otherwise returns the count of of HA VMs per host, normalized between 0 and 100 and divided by ScaleDown	All[1]
OptimalForMemoryEvenDistribution	Prefers hosts with more available memory	None	Divide the scheduling memory of the host by the maximum scheduling memory of any host in the cluster, then multiply by MaxSchedulerWeight and subtract from MaxSchedulerWeight. So that the host with highest memory gets score 0 and hosts with less available memory get scores towards MaxSchedulerWeight as the memory lowers	none[1], evenly_distributed[1] and cluster_maintenance[1]
OptimalForMemoryPowerSaving	Prefers hosts that the memory usage is higher, but within the under and over utilized limits	MaxFreeMemoryForOverUtilized, MinFreeMemoryForUnderUtilized	The scheduling memory normalized between 1 and MaxSchedulerWeight, but a high value if the memory is too low (under limit)	power_saving[1]
PreferredHosts	Prefer hosts that the VM is pinned to	None	0 if the VM is pinned to host or 10,000 if not	All[99]
VmAffinityGroups	Prefer hosts that satisfy VM to VM soft affinity rules	None	Rank of hosts from 1 to N, based on how many VM to VM soft affinity group rules would break. Score 1 being the host that breaks the least amount of rules and N the most. Breaking a higher priority rules is more important than N lower priority rules for ranking	All[1]
VmToHostsAffinityGroups	Prefer hosts that satisfy VM to Host soft affinity rules	None	Rank of hosts from 1 to N, based on how many VM to Host soft affinity group rules would break. Score 1 being the host that breaks the least amount of rules and N the most. Breaking a higher priority rules is more important than N lower priority rules for ranking	All[20]

Load Balancer Details

Please see below how each Load Balancer uses these properties to schedule VMs.

Load Balancer	Description and Parameters	Used by default policy
None	After the VM is started, no further scheduling is done by the cluster policy. But it still may be migrated for other reasons such as host issues	none, cluster_maintenance,
OptimalForEvenGuestDistribution	If a hosts has more than `HighVmCount` VMs running or if the difference between the host with the most VMs running and the host with the least VMs running is greater than `MigrationThreshold` VMs, then migration is started. For score calculation and scheduling on SPM, `SpmVmGrace` is added to the number of VMs running	vm_evenly_distributed
OptimalForEvenDistribution	If a host's CPU usage is higher than `HighUtilization` for more than `CpuOverCommitDurationMinutes` time then migration starts	evenly_distributed
OptimalForPowerSaving	Concentrates load on a subset of hosts but keeping their CPU load below `HighUtilization`. For hosts that CPU usage drops below `LowUtilization`, all remaining VMs are migrated and the host can be shutdown	power_saving

Diagnostic Steps

To see or debug scheduling, scores and ranking in RHV, enable debug on org.ovirt.engine.core.bll.scheduling. Below is an example:

Scenario:

Cluster with just 2 hypervisors
Simplified scheduling policy, the only weight is VMAffinityGroup, with a factor of 10.
2 VMs, with negative affinity.

1. Run the first VM, here is the debug output:

2021-03-08 21:21:48,365Z DEBUG [org.ovirt.engine.core.bll.scheduling.policyunits.RankSelectorPolicyUnit] (EE-ManagedThreadFactory-engine-Thread-386) [a1d49e42-ce62-4067-a68d-de7f9e9c66c9] Ranking selector:
*;factor;1b479109-b38e-465b-a8a5-439d79cf43e9;;f79abb53-4374-4d5a-8fb0-f1fde5030643;
84e6ddee-ab0d-42dd-82f0-c297779db567;10;1;1;1;1

The above is to be read as table, the 84e6ddee row is the only weight configured on the simplified scheduling policy. That ID corresponds
to VmAffinityWeightPolicyUnit (named VmAffinityGroups). The 1b479109 and f79abb53 columns are the hypervisors:

[RHV-M]# /usr/share/ovirt-engine/dbscripts/engine-psql.sh -c "SELECT vds_id,vds_name from vds";
                vds_id                |              vds_name              
--------------------------------------+------------------------------------
 1b479109-b38e-465b-a8a5-439d79cf43e9 | host2.kvm.local
 f79abb53-4374-4d5a-8fb0-f1fde5030643 | host3.kvm.local

[SOURCE CODE]$ grep -rn 84e6ddee-ab0d-42dd-82f0-c297779db567
backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/scheduling/policyunits/VmAffinityWeightPolicyUnit.java:34:        guid = "84e6ddee-ab0d-42dd-82f0-c297779db567",

And under each host we have "1;1". This is <score;rank>

So we can read the log above as:

Weight Module	Weight	host2-score	host2-rank	host3-score	host3-rank
VmAffinityGroups	10	1	1	1	1

So in this case both hosts got the same rank, as rank ties are allowed.

2. This VM was started on host host2.kvm.local.

3. Now run the second VM, which has negative affinity to the first VM on host2.kvm.local just started above.

Here is the debug output again:

2021-03-08 21:21:59,781Z DEBUG [org.ovirt.engine.core.bll.scheduling.policyunits.RankSelectorPolicyUnit] (EE-ManagedThreadFactory-engine-Thread-393) [0dc5eb6f-88cc-4d5f-8b15-001b5bbf0511] Ranking selector:
*;factor;1b479109-b38e-465b-a8a5-439d79cf43e9;;f79abb53-4374-4d5a-8fb0-f1fde5030643;
84e6ddee-ab0d-42dd-82f0-c297779db567;10;0;2;1;1

Interpreting the same way as before, it means:

Weight Module	Weight	host2-score	host2-rank	host3-score	host3-rank
VmAffinityWeight	10	0	2	1	1

4. This time there is a clear preference, as host3.kvm.local got rank 1 and host2.kvm.local got rank 2, so the VM started on host3.kvm.local. This was due to the negative affinity, so the VMs are kept separate as configured.

SBR

Virtualization

Product(s)

Red Hat Virtualization

Components

rhev-hypervisor

Category

Learn more

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.