RHV: How does cluster scheduling policy work?
Environment
- Red Hat Virtualization 4.4
Issue
- How does cluster scheduling policy work?
Resolution
A cluster policy works by using a set of rules to determine how to schedule virtual machines amongst the hosts in the cluster.
A Virtual Machine is scheduled when it is started or migrated. The scheduling process aims to determine a host for the VM to start or to be migrated to. It works in 3 steps:
1. Filters - One or more more enabled filter module rule out hosts based on some criteria
2. Weights - One or more enabled weight module calculates scores for the remaining hosts from most suitable (lowest score) to less suitable (highest score).
3. Ranking - Based on the scores, ranks the host from first to last for each weight module. The host with lowest total accumulated ranks is picked and gets the VM.
Finally, if a Load Balancer is in use by the policy, it can trigger automatic migrations of already running VMs so that VMs can be migrated to automatically accomplish the objective of the Load Balancer.
The Load Balancer rules are checked every 1 minute by default, while the Filters, Weights and Ranking logic happens on demand when Virtual Machines need to be scheduled, either by power on or live migrations.
So essentially, each policy has 3 sets of logic rules, please see their names and the description of each in the table below:
| Module | Description |
|---|---|
| Filters | These rules are used to determine if a certain host is considered or not when deciding where run a VM. They filter hosts based on requirements such as minimum CPU or RAM. So, if a host does not meet the minimum requirements then it is not even considered for running the VM. |
| Weights | After some hosts are ruled out due to the filters above, the remaining ones are scored, ranked and the one with lowest rank receives the VM. These weight rules specify which attributes are taken into account and how important (weight) they are in order to calculate the hosts scores. For example, HA weights hosts according to their availability score. If HA weight is higher then the others then it has a higher impact in the score than the others. |
| Load Balancer | This logic determines which hosts are under or over-utilized, and initiates migrations accordingly. For example, the OptimalForPowerSaving balancer is used to consolidate loads into fewer hosts. Each balancer has a few properties to further define and customize their behavior. The Load Balancer will not start or stop VMs, only initiate migrations to satisfy its rules. |
The policies using the 3 criteria above can be customized and/or created in Admin Portal->Configure->Cluster Policies. However, RHV ships with 5 pre-defined policies that should cover most use cases. They are named none, cluster_maintenance, evenly_distributed, vm_evenly_distributed and power_saving. Please see explanations for each of the policies RHEV ships with by default in the RHV 4.4 Documentation - Technical Reference - LOAD BALANCING, SCHEDULING, AND MIGRATION
Filters Details
The available filters are described in the table below. Please note that Mandatory filters (last column on the table below) are always enabled and not available for user configuration.
| Filter | Description | Parameters | Used by default policy | Mandatory |
|---|---|---|---|---|
| ClusterInMaintenance | Host cannot receive the VM unless the VM is HA or being Migrated, prevents VMs from starting | None | cluster_maintenance | No |
| Compatibility-Version | Host can only receive the VM if it supports the Compatibility Version of the VM | None | All | Yes |
| CPU | Host can only receive the VM if it has equal or more CPU cores than the VM. The "Count Threads as Cores" cluster setting influences this calculation | None | All | No |
| CPU-Level | Host can only receive the VM if it provides all CPU flags required by the VM CPU model setting (if different from Cluster setting). Note: VMs with Pass-Through CPU will always validate on start, but on migration the destination host must have all the flags that the VM was started with on the first host | None | All | Yes |
| CPUOverloaded | Host is filtered out if CPU usage is above the HighUtilization threashold for at least an interval of CpuOverCommitDurationMinutes time | HighUtilization, CpuOverCommitDurationMinutes | All | No |
| CpuPinning | Host can only receive the VM if it can satisfy VM CPU pinning requirements | None | All | Yes |
| Emulated-Machine | Host can only receive the VM if it provides the required emulated machine type | None | All | No |
| HA | Host can receive the Hosted-Engine VM only if it has a Hosted-Engine HA score higher than zero and equal or higher than the score of the current host running it | None | All | No |
| HostDevice | Host can receive the VM only if it has passthrough enabled and can provide the host devices required by the VM | None | All | Yes |
| HostedEngineSpares | Host can receive the VM only if there will still be enough available memory in the cluster to provide fail-over capability for the Hosted-Engine VM in case the current host fails | HeSparesCount | All | No |
| HugePages | In case the VM is configured to use Huge Pages, the host can only receive the VM if it can provide the number of Huge Pages necessary for the VM memory | None | All | No |
| InClusterUpgrade | Host can only receive the VM if its software version is equal or higher than the current host running the VM | None | unused, legacy policy | No |
| MDevice | In case the VM is configured with mediated devices (mDev), host can only run the VM if it is able to provide the required mDev | None | All | Yes |
| Memory | Host can only run the VM if it has enough free physical and scheduling memory to satisfy the VM memory requirements | None | All | No |
| Migration | Prevents migration to the same host, in case of DNS error for example | None | All | No |
| Migration-Tsc-Frequency | If the VM is of type High Performance, the Host can receive the VM only if it has the same TSC frequency of the host currently running the VM | None | All | No |
| Network | Host can only receive the VM if it can provide all the networks required by the VMs NICs and Display Network | None | All | No |
| NUMA | If the VM is configured with vNUMA strict or interleave, the Host can only receive the VM if its NUMA nodes can accomodate the Virtual Machine vNUMA nodes in terms or resources | None | All | No |
| PinToHost | Host can only receive the VM if the VM is pinned to that host | None | All | Yes |
| Swap | Host can only receive the VM if it is not swapping above the threshold | MaximumAllowedSwapUsage | All | No |
| VmAffinityGroups | Host can only run the Virtual Machine if it does not break hard affinity rules for VMs | None | All | No |
| VM leases ready | In case the VM is configured with storage leases, the host can only receive the VMif it is ready to support VM leases | None | All | Yes |
| VmToHostsAffinityGroups | Host can only run the Virtual Machine if it does not break hard affinity rules with Hosts | None | All | No |
Weights Details
The available weights are described in the table below. Note MaxSchedulerWeight referenced in the table below is set by default to 1000 and lowest score is best.
| Weight | Description | Parameters | Scoring | Used by default policy[weight] |
|---|---|---|---|---|
| CPU and NUMA pinning compatibility | If the VM has vNUMA and pinning, prefer hosts that the CPU pinning won't clash with the vNUMA pinning | None | 1 if it fits or MaxSchedulerWeight if it doesn't | All[5] |
| CPU for high performance VMs | Prefers hosts that have more or equal number of sockets, cores and threads than the VM | None | 1 if it fits or MaxSchedulerWeight if not | All[4] |
| Fit VM to single host NUMA node | If the VM does not have vNUMA, prefer hosts that can fir the VM in a single physical NUMA | None | 1 if it fits or MaxSchedulerWeight if not | - |
| HA | For the HostedEngine VM, prefers hosts that have higher HA Score | None | Normalizes the HA Scores to a value between 1 and MaxSchedulerWeight, returning score 1 to the highest HA score and MaxSchedulerWeight to the lowest HA score | All[1] |
| InClusterUpgrade | Prefers migrating VM to host with newer or equal OS versions than the host running the VM, penalizing hosts with older OS | None | 0 if the host OS is same or more recent or 100,000 if at least same major release or 1,000,000 if even older | unused, legacy policy |
| OptimalForCpuEvenDistribution | Prefers hosts that the CPU usage is lower | None | 1 for the host(s) with lowest CPU usage, increasing up to maxSchedulerScore for the highest CPU usage | none[2], cluster_maintenance[2] and evenly_distributed[2] |
| OptimalForCpuPowerSaving | Prefers hosts that the CPU usage is higher, but lower than HighUtilization | HighUtilization | 1 for the host(s) with highest CPU usage, increasing up to maxSchedulerScore for the lowest CPU usage, and maxSchedulerScore if CPU usage is above HighUtilization | power_saving[2] |
| OptimalForEvenGuestDistribution | Prefers host with less VMs running | SpmVmGrace | The number of VMs running on the host as the score, offset by SpmVmGrace if the Host is SPM | vm_evenly_distributed[2] |
| OptimalForHaReservation | For clusters with HA reservation enabled and HA VMs, prefers hosts with less HA VMs | ScaleDown | 0 if the VM is not HA or Cluster HA reservation is disabled, otherwise returns the count of of HA VMs per host, normalized between 0 and 100 and divided by ScaleDown | All[1] |
| OptimalForMemoryEvenDistribution | Prefers hosts with more available memory | None | Divide the scheduling memory of the host by the maximum scheduling memory of any host in the cluster, then multiply by MaxSchedulerWeight and subtract from MaxSchedulerWeight. So that the host with highest memory gets score 0 and hosts with less available memory get scores towards MaxSchedulerWeight as the memory lowers | none[1], evenly_distributed[1] and cluster_maintenance[1] |
| OptimalForMemoryPowerSaving | Prefers hosts that the memory usage is higher, but within the under and over utilized limits | MaxFreeMemoryForOverUtilized, MinFreeMemoryForUnderUtilized | The scheduling memory normalized between 1 and MaxSchedulerWeight, but a high value if the memory is too low (under limit) | power_saving[1] |
| PreferredHosts | Prefer hosts that the VM is pinned to | None | 0 if the VM is pinned to host or 10,000 if not | All[99] |
| VmAffinityGroups | Prefer hosts that satisfy VM to VM soft affinity rules | None | Rank of hosts from 1 to N, based on how many VM to VM soft affinity group rules would break. Score 1 being the host that breaks the least amount of rules and N the most. Breaking a higher priority rules is more important than N lower priority rules for ranking | All[1] |
| VmToHostsAffinityGroups | Prefer hosts that satisfy VM to Host soft affinity rules | None | Rank of hosts from 1 to N, based on how many VM to Host soft affinity group rules would break. Score 1 being the host that breaks the least amount of rules and N the most. Breaking a higher priority rules is more important than N lower priority rules for ranking | All[20] |
Load Balancer Details
Please see below how each Load Balancer uses these properties to schedule VMs.
| Load Balancer | Description and Parameters | Used by default policy |
|---|---|---|
| None | After the VM is started, no further scheduling is done by the cluster policy. But it still may be migrated for other reasons such as host issues | none, cluster_maintenance, |
| OptimalForEvenGuestDistribution | If a hosts has more than HighVmCount VMs running or if the difference between the host with the most VMs running and the host with the least VMs running is greater than MigrationThreshold VMs, then migration is started. For score calculation and scheduling on SPM, SpmVmGrace is added to the number of VMs running | vm_evenly_distributed |
| OptimalForEvenDistribution | If a host's CPU usage is higher than HighUtilization for more than CpuOverCommitDurationMinutes time then migration starts | evenly_distributed |
| OptimalForPowerSaving | Concentrates load on a subset of hosts but keeping their CPU load below HighUtilization. For hosts that CPU usage drops below LowUtilization, all remaining VMs are migrated and the host can be shutdown | power_saving |
Diagnostic Steps
To see or debug scheduling, scores and ranking in RHV, enable debug on org.ovirt.engine.core.bll.scheduling. Below is an example:
Scenario:
- Cluster with just 2 hypervisors
- Simplified scheduling policy, the only weight is VMAffinityGroup, with a factor of 10.
- 2 VMs, with negative affinity.
1. Run the first VM, here is the debug output:
2021-03-08 21:21:48,365Z DEBUG [org.ovirt.engine.core.bll.scheduling.policyunits.RankSelectorPolicyUnit] (EE-ManagedThreadFactory-engine-Thread-386) [a1d49e42-ce62-4067-a68d-de7f9e9c66c9] Ranking selector:
*;factor;1b479109-b38e-465b-a8a5-439d79cf43e9;;f79abb53-4374-4d5a-8fb0-f1fde5030643;
84e6ddee-ab0d-42dd-82f0-c297779db567;10;1;1;1;1
The above is to be read as table, the 84e6ddee row is the only weight configured on the simplified scheduling policy. That ID corresponds
to VmAffinityWeightPolicyUnit (named VmAffinityGroups). The 1b479109 and f79abb53 columns are the hypervisors:
[RHV-M]# /usr/share/ovirt-engine/dbscripts/engine-psql.sh -c "SELECT vds_id,vds_name from vds";
vds_id | vds_name
--------------------------------------+------------------------------------
1b479109-b38e-465b-a8a5-439d79cf43e9 | host2.kvm.local
f79abb53-4374-4d5a-8fb0-f1fde5030643 | host3.kvm.local
[SOURCE CODE]$ grep -rn 84e6ddee-ab0d-42dd-82f0-c297779db567
backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/scheduling/policyunits/VmAffinityWeightPolicyUnit.java:34: guid = "84e6ddee-ab0d-42dd-82f0-c297779db567",
And under each host we have "1;1". This is <score;rank>
So we can read the log above as:
| Weight Module | Weight | host2-score | host2-rank | host3-score | host3-rank |
|---|---|---|---|---|---|
| VmAffinityGroups | 10 | 1 | 1 | 1 | 1 |
So in this case both hosts got the same rank, as rank ties are allowed.
2. This VM was started on host host2.kvm.local.
3. Now run the second VM, which has negative affinity to the first VM on host2.kvm.local just started above.
Here is the debug output again:
2021-03-08 21:21:59,781Z DEBUG [org.ovirt.engine.core.bll.scheduling.policyunits.RankSelectorPolicyUnit] (EE-ManagedThreadFactory-engine-Thread-393) [0dc5eb6f-88cc-4d5f-8b15-001b5bbf0511] Ranking selector:
*;factor;1b479109-b38e-465b-a8a5-439d79cf43e9;;f79abb53-4374-4d5a-8fb0-f1fde5030643;
84e6ddee-ab0d-42dd-82f0-c297779db567;10;0;2;1;1
Interpreting the same way as before, it means:
| Weight Module | Weight | host2-score | host2-rank | host3-score | host3-rank |
|---|---|---|---|---|---|
| VmAffinityWeight | 10 | 0 | 2 | 1 | 1 |
4. This time there is a clear preference, as host3.kvm.local got rank 1 and host2.kvm.local got rank 2, so the VM started on host3.kvm.local. This was due to the negative affinity, so the VMs are kept separate as configured.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.