RHV: How does cluster scheduling policy work?

Solution Verified - Updated

Environment

  • Red Hat Virtualization 4.4

Issue

  • How does cluster scheduling policy work?

Resolution

A cluster policy works by using a set of rules to determine how to schedule virtual machines amongst the hosts in the cluster.

A Virtual Machine is scheduled when it is started or migrated. The scheduling process aims to determine a host for the VM to start or to be migrated to. It works in 3 steps:
1. Filters - One or more more enabled filter module rule out hosts based on some criteria
2. Weights - One or more enabled weight module calculates scores for the remaining hosts from most suitable (lowest score) to less suitable (highest score).
3. Ranking - Based on the scores, ranks the host from first to last for each weight module. The host with lowest total accumulated ranks is picked and gets the VM.

Finally, if a Load Balancer is in use by the policy, it can trigger automatic migrations of already running VMs so that VMs can be migrated to automatically accomplish the objective of the Load Balancer.

The Load Balancer rules are checked every 1 minute by default, while the Filters, Weights and Ranking logic happens on demand when Virtual Machines need to be scheduled, either by power on or live migrations.

So essentially, each policy has 3 sets of logic rules, please see their names and the description of each in the table below:

ModuleDescription
FiltersThese rules are used to determine if a certain host is considered or not when deciding where run a VM. They filter hosts based on requirements such as minimum CPU or RAM. So, if a host does not meet the minimum requirements then it is not even considered for running the VM.
WeightsAfter some hosts are ruled out due to the filters above, the remaining ones are scored, ranked and the one with lowest rank receives the VM. These weight rules specify which attributes are taken into account and how important (weight) they are in order to calculate the hosts scores. For example, HA weights hosts according to their availability score. If HA weight is higher then the others then it has a higher impact in the score than the others.
Load BalancerThis logic determines which hosts are under or over-utilized, and initiates migrations accordingly. For example, the OptimalForPowerSaving balancer is used to consolidate loads into fewer hosts. Each balancer has a few properties to further define and customize their behavior. The Load Balancer will not start or stop VMs, only initiate migrations to satisfy its rules.

The policies using the 3 criteria above can be customized and/or created in Admin Portal->Configure->Cluster Policies. However, RHV ships with 5 pre-defined policies that should cover most use cases. They are named none, cluster_maintenance, evenly_distributed, vm_evenly_distributed and power_saving. Please see explanations for each of the policies RHEV ships with by default in the RHV 4.4 Documentation - Technical Reference - LOAD BALANCING, SCHEDULING, AND MIGRATION

Filters Details

The available filters are described in the table below. Please note that Mandatory filters (last column on the table below) are always enabled and not available for user configuration.

FilterDescriptionParametersUsed by default policyMandatory
ClusterInMaintenanceHost cannot receive the VM unless the VM is HA or being Migrated, prevents VMs from startingNonecluster_maintenanceNo
Compatibility-VersionHost can only receive the VM if it supports the Compatibility Version of the VMNoneAllYes
CPUHost can only receive the VM if it has equal or more CPU cores than the VM. The "Count Threads as Cores" cluster setting influences this calculationNoneAllNo
CPU-LevelHost can only receive the VM if it provides all CPU flags required by the VM CPU model setting (if different from Cluster setting). Note: VMs with Pass-Through CPU will always validate on start, but on migration the destination host must have all the flags that the VM was started with on the first hostNoneAllYes
CPUOverloadedHost is filtered out if CPU usage is above the HighUtilization threashold for at least an interval of CpuOverCommitDurationMinutes timeHighUtilization, CpuOverCommitDurationMinutesAllNo
CpuPinningHost can only receive the VM if it can satisfy VM CPU pinning requirementsNoneAllYes
Emulated-MachineHost can only receive the VM if it provides the required emulated machine typeNoneAllNo
HAHost can receive the Hosted-Engine VM only if it has a Hosted-Engine HA score higher than zero and equal or higher than the score of the current host running itNoneAllNo
HostDeviceHost can receive the VM only if it has passthrough enabled and can provide the host devices required by the VMNoneAllYes
HostedEngineSparesHost can receive the VM only if there will still be enough available memory in the cluster to provide fail-over capability for the Hosted-Engine VM in case the current host failsHeSparesCountAllNo
HugePagesIn case the VM is configured to use Huge Pages, the host can only receive the VM if it can provide the number of Huge Pages necessary for the VM memoryNoneAllNo
InClusterUpgradeHost can only receive the VM if its software version is equal or higher than the current host running the VMNoneunused, legacy policyNo
MDeviceIn case the VM is configured with mediated devices (mDev), host can only run the VM if it is able to provide the required mDevNoneAllYes
MemoryHost can only run the VM if it has enough free physical and scheduling memory to satisfy the VM memory requirementsNoneAllNo
MigrationPrevents migration to the same host, in case of DNS error for exampleNoneAllNo
Migration-Tsc-FrequencyIf the VM is of type High Performance, the Host can receive the VM only if it has the same TSC frequency of the host currently running the VMNoneAllNo
NetworkHost can only receive the VM if it can provide all the networks required by the VMs NICs and Display NetworkNoneAllNo
NUMAIf the VM is configured with vNUMA strict or interleave, the Host can only receive the VM if its NUMA nodes can accomodate the Virtual Machine vNUMA nodes in terms or resourcesNoneAllNo
PinToHostHost can only receive the VM if the VM is pinned to that hostNoneAllYes
SwapHost can only receive the VM if it is not swapping above the thresholdMaximumAllowedSwapUsageAllNo
VmAffinityGroupsHost can only run the Virtual Machine if it does not break hard affinity rules for VMsNoneAllNo
VM leases readyIn case the VM is configured with storage leases, the host can only receive the VMif it is ready to support VM leasesNoneAllYes
VmToHostsAffinityGroupsHost can only run the Virtual Machine if it does not break hard affinity rules with HostsNoneAllNo

Weights Details

The available weights are described in the table below. Note MaxSchedulerWeight referenced in the table below is set by default to 1000 and lowest score is best.

WeightDescriptionParametersScoringUsed by default policy[weight]
CPU and NUMA pinning compatibilityIf the VM has vNUMA and pinning, prefer hosts that the CPU pinning won't clash with the vNUMA pinningNone1 if it fits or MaxSchedulerWeight if it doesn'tAll[5]
CPU for high performance VMsPrefers hosts that have more or equal number of sockets, cores and threads than the VMNone1 if it fits or MaxSchedulerWeight if notAll[4]
Fit VM to single host NUMA nodeIf the VM does not have vNUMA, prefer hosts that can fir the VM in a single physical NUMANone1 if it fits or MaxSchedulerWeight if not-
HAFor the HostedEngine VM, prefers hosts that have higher HA ScoreNoneNormalizes the HA Scores to a value between 1 and MaxSchedulerWeight, returning score 1 to the highest HA score and MaxSchedulerWeight to the lowest HA scoreAll[1]
InClusterUpgradePrefers migrating VM to host with newer or equal OS versions than the host running the VM, penalizing hosts with older OSNone0 if the host OS is same or more recent or 100,000 if at least same major release or 1,000,000 if even olderunused, legacy policy
OptimalForCpuEvenDistributionPrefers hosts that the CPU usage is lowerNone1 for the host(s) with lowest CPU usage, increasing up to maxSchedulerScore for the highest CPU usagenone[2], cluster_maintenance[2] and evenly_distributed[2]
OptimalForCpuPowerSavingPrefers hosts that the CPU usage is higher, but lower than HighUtilizationHighUtilization1 for the host(s) with highest CPU usage, increasing up to maxSchedulerScore for the lowest CPU usage, and maxSchedulerScore if CPU usage is above HighUtilizationpower_saving[2]
OptimalForEvenGuestDistributionPrefers host with less VMs runningSpmVmGraceThe number of VMs running on the host as the score, offset by SpmVmGrace if the Host is SPMvm_evenly_distributed[2]
OptimalForHaReservationFor clusters with HA reservation enabled and HA VMs, prefers hosts with less HA VMsScaleDown0 if the VM is not HA or Cluster HA reservation is disabled, otherwise returns the count of of HA VMs per host, normalized between 0 and 100 and divided by ScaleDownAll[1]
OptimalForMemoryEvenDistributionPrefers hosts with more available memoryNoneDivide the scheduling memory of the host by the maximum scheduling memory of any host in the cluster, then multiply by MaxSchedulerWeight and subtract from MaxSchedulerWeight. So that the host with highest memory gets score 0 and hosts with less available memory get scores towards MaxSchedulerWeight as the memory lowersnone[1], evenly_distributed[1] and cluster_maintenance[1]
OptimalForMemoryPowerSavingPrefers hosts that the memory usage is higher, but within the under and over utilized limitsMaxFreeMemoryForOverUtilized, MinFreeMemoryForUnderUtilizedThe scheduling memory normalized between 1 and MaxSchedulerWeight, but a high value if the memory is too low (under limit)power_saving[1]
PreferredHostsPrefer hosts that the VM is pinned toNone0 if the VM is pinned to host or 10,000 if notAll[99]
VmAffinityGroupsPrefer hosts that satisfy VM to VM soft affinity rulesNoneRank of hosts from 1 to N, based on how many VM to VM soft affinity group rules would break. Score 1 being the host that breaks the least amount of rules and N the most. Breaking a higher priority rules is more important than N lower priority rules for rankingAll[1]
VmToHostsAffinityGroupsPrefer hosts that satisfy VM to Host soft affinity rulesNoneRank of hosts from 1 to N, based on how many VM to Host soft affinity group rules would break. Score 1 being the host that breaks the least amount of rules and N the most. Breaking a higher priority rules is more important than N lower priority rules for rankingAll[20]

Load Balancer Details

Please see below how each Load Balancer uses these properties to schedule VMs.

Load BalancerDescription and ParametersUsed by default policy
NoneAfter the VM is started, no further scheduling is done by the cluster policy. But it still may be migrated for other reasons such as host issuesnone, cluster_maintenance,
OptimalForEvenGuestDistributionIf a hosts has more than HighVmCount VMs running or if the difference between the host with the most VMs running and the host with the least VMs running is greater than MigrationThreshold VMs, then migration is started. For score calculation and scheduling on SPM, SpmVmGrace is added to the number of VMs runningvm_evenly_distributed
OptimalForEvenDistributionIf a host's CPU usage is higher than HighUtilization for more than CpuOverCommitDurationMinutes time then migration startsevenly_distributed
OptimalForPowerSavingConcentrates load on a subset of hosts but keeping their CPU load below HighUtilization. For hosts that CPU usage drops below LowUtilization, all remaining VMs are migrated and the host can be shutdownpower_saving

Diagnostic Steps

To see or debug scheduling, scores and ranking in RHV, enable debug on org.ovirt.engine.core.bll.scheduling. Below is an example:

Scenario:

  • Cluster with just 2 hypervisors
  • Simplified scheduling policy, the only weight is VMAffinityGroup, with a factor of 10.
  • 2 VMs, with negative affinity.

1. Run the first VM, here is the debug output:

2021-03-08 21:21:48,365Z DEBUG [org.ovirt.engine.core.bll.scheduling.policyunits.RankSelectorPolicyUnit] (EE-ManagedThreadFactory-engine-Thread-386) [a1d49e42-ce62-4067-a68d-de7f9e9c66c9] Ranking selector:
*;factor;1b479109-b38e-465b-a8a5-439d79cf43e9;;f79abb53-4374-4d5a-8fb0-f1fde5030643;
84e6ddee-ab0d-42dd-82f0-c297779db567;10;1;1;1;1

The above is to be read as table, the 84e6ddee row is the only weight configured on the simplified scheduling policy. That ID corresponds
to VmAffinityWeightPolicyUnit (named VmAffinityGroups). The 1b479109 and f79abb53 columns are the hypervisors:

[RHV-M]# /usr/share/ovirt-engine/dbscripts/engine-psql.sh -c "SELECT vds_id,vds_name from vds";
                vds_id                |              vds_name              
--------------------------------------+------------------------------------
 1b479109-b38e-465b-a8a5-439d79cf43e9 | host2.kvm.local
 f79abb53-4374-4d5a-8fb0-f1fde5030643 | host3.kvm.local

[SOURCE CODE]$ grep -rn 84e6ddee-ab0d-42dd-82f0-c297779db567
backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/scheduling/policyunits/VmAffinityWeightPolicyUnit.java:34:        guid = "84e6ddee-ab0d-42dd-82f0-c297779db567",

And under each host we have "1;1". This is <score;rank>

So we can read the log above as:

Weight ModuleWeighthost2-scorehost2-rankhost3-scorehost3-rank
VmAffinityGroups101111

So in this case both hosts got the same rank, as rank ties are allowed.

2. This VM was started on host host2.kvm.local.

3. Now run the second VM, which has negative affinity to the first VM on host2.kvm.local just started above.

Here is the debug output again:

2021-03-08 21:21:59,781Z DEBUG [org.ovirt.engine.core.bll.scheduling.policyunits.RankSelectorPolicyUnit] (EE-ManagedThreadFactory-engine-Thread-393) [0dc5eb6f-88cc-4d5f-8b15-001b5bbf0511] Ranking selector:
*;factor;1b479109-b38e-465b-a8a5-439d79cf43e9;;f79abb53-4374-4d5a-8fb0-f1fde5030643;
84e6ddee-ab0d-42dd-82f0-c297779db567;10;0;2;1;1

Interpreting the same way as before, it means:

Weight ModuleWeighthost2-scorehost2-rankhost3-scorehost3-rank
VmAffinityWeight100211

4. This time there is a clear preference, as host3.kvm.local got rank 1 and host2.kvm.local got rank 2, so the VM started on host3.kvm.local. This was due to the negative affinity, so the VMs are kept separate as configured.

Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.