OpenShift Virtualization - Tuning & Scaling Guide
Table of Contents
Overview
Goal
OpenShift Container Platform defaults are designed to work well out of the box. Many pod and VM scenarios and recommendations are documented in Scalability and Performance. In addition, virtual machine (VM) templates, by default, include a High-performance workload type.
This tuning guide is a supplemental document. It focuses on fine-tuning cluster scalability and VM performance for several use cases and environments, for a variety of workloads and cluster sizes. This document provides guidance for scaling up and scaling out an environment, up to 100 OpenShift nodes based on this This content is not included.reference architecture, and for highly tuning each component of a VM definition. Additional OpenShift Virtualization cluster guidance can also be found in the Reference Implementation Guide.
Note
Determining scaling recommendations and limitations is heavily dependent on the exact use case. The impact of scaling each cluster dimension should be evaluated against workload performance needs. Large cluster sizes reaching beyond 100 - 150 nodes might be better managed as separate control planes using Red Hat Advanced Cluster Management.
Guide: Cluster configurations
In general, there are a few configuration details that should be considered for larger scale clusters, regardless of whether workloads run as pods or VMs. This section summarizes some key areas for general cluster performance.
Etcd
It is very important to use high-speed storage for master nodes running etcd, especially for larger clusters. We recommend at least SSD (preferably NVMe)-backed local storage.
Your hardware should be capable of providing storage performance such that etcd write-ahead-log fsync duration and network peer round trip time complete within the recommended thresholds. See the following guides on etcd performance:
- Recommended etcd practices
- Backend performance requirements for OpenShift etcd
- How to use 'fio' to check etcd disk performance in OCP
Infrastructure components
Restricting critical cluster components to “infra” nodes allows them to properly grow along with cluster/workload scalability. It also provides isolated resources to prevent disruption of worker node performance. See Infrastructure Nodes in OpenShift 4 for configuring components such as the router, registry, monitoring, and logging.
Monitoring
In addition to “infra” node isolation, consider configuring persistent storage for Prometheus and AlertManager.
This provides two advantages:
- Metric and Alert data will survive node/pod restarts, up to the configured retention time.
- PromDB will not fill up the RHCOS sysroot partition, which can cause a node to become unavailable.
See This content is not included.Configuring the monitoring stack in the OpenShift documentation for details.
Note that PromDB growth rates, especially in terms of memory usage and storage size, depend on many factors including pod churn rates, VM start/stop rates, number of VM migrations, number of PVCs, and configured retention time.
Tested maximums
See the documented limits to understand the current tested maximums.
Worker MCP configuration
If you have a large number of nodes and want to reduce serial reboot times, consider increasing the maxUnavailable setting in the worker MachineConfigPool to speed up roll-outs of MachineConfig changes and cluster updates, this setting will control how many workers can be drained in parallel. Note that if multiple worker MCPs are created, by default all pools can drain in parallel, otherwise the pool draining behavior can be controlled by pausing or unpausing each pool.
Some considerations when increasing the number of unavailable nodes:
If any installed operators or deployments have set a PodDisruptionBudget, it is important to have enough available nodes to meet that budget. To list all cluster PDBs, run the following command:
$ oc get pdb -A
If nodes are providing OpenShift Data Foundation storage, MCPs can be created to match the data replica topology to ensure the PDB allows parallel node drains for a single pool, by first labeling nodes with the 3 “rack” topology labels before the StorageCluster install:
$ oc label node {$nodeA} topology.rook.io/rack=rack0
$ oc label node {$nodeB} topology.rook.io/rack=rack1
$ oc label node {$nodeC} topology.rook.io/rack=rack2
Then you create 3 node roles and pools to match the 3 rack topology labels.
If the cluster is running migratable VMs, the migration resulting from any node drains will require adequate memory (and cpu) request capacity to allow a new destination virt-launcher pod to be scheduled on a new worker node. Confirm that the combined available node spare request capacity is sufficient to account for the (total number of nodes draining in parallel) x (total number of VMs allowed to migrate in parallel, based on the cluster migration limits) x (VM mem/cpu request). See Configuring live migration limits and timeouts in the OpenShift documentation.
Scheduling
By default Kubelet reports a list of container images residing on each node, sorted in order of largest to smallest, up to the nodeStatusMaxImage count. This limit was created to ensure API objects do not become very large. The image list is used to determine a scheduling score for the imageLocality plugin, which OpenShift uses as part of the overall default pod scheduling score.
To ensure balanced node scheduling on clusters that can potentially have many container images, it is recommended to disable the nodeStatusMaxImage count so that the imageLocality scheduler plugin does not “override” the nodeResource scheduler scores because there are more than the default number of container images on a node (50). See the note in Managing nodes for details. To do so, apply a KubeletConfig to disable the nodeStatusMaxImages value.
Label worker nodes for the KubeletConfig change:
$ oc label machineconfigpool worker custom-kubelet=max-node-images
Apply the KubeletConfig change:
Note: this will reboot all workers to apply a newly generated MachineConfig.
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: max-node-images
spec:
machineConfigPoolSelector:
matchLabels:
custom-kubelet: max-node-images
kubeletConfig:
nodeStatusMaxImages: -1
Max pod limit
When calculating total desired object density, consider that the default maxPod limit of 250 pods per node includes cluster and operator pods. It is possible to raise the maxPods value, but keep in mind increasing pod density can lead to additional control plane stress.
Also plan for the fact that VM migration will temporarily create an additional pod during the migration process, factor this into density planning based on how many total VMs will be allowed to migrate at once.
KubeAPI Burst/QPS
Increasing the default kubeAPI Burst and Queries Per Second (QPS) values can improve performance of bulk object creation at scale.
By default the values are set to 100 and 50 respectively to keep API server compute resource utilization reasonably low to accommodate small nodes, however when scaling to high total pod counts it can help to double the default values using the KubeletConfig.
Label worker nodes for the KubeletConfig change:
$ oc label machineconfigpool worker custom-kubelet=set-api-rates
Apply the KubeletConfig change:
Note: this will reboot all workers to apply a newly generated MachineConfig.
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: set-api-rates
spec:
machineConfigPoolSelector:
matchLabels:
custom-kubelet: set-api-rates
kubeletConfig:
kubeAPIBurst: 200
kubeAPIQPS: 100
System Resource Reservation
By default, a small amount of cpu and memory resources are reserved for host operations on nodes using the system-reserved setting. For nodes with very large total memory, or for high density cases where the total VM/pod memory usage may grow to utilize nearly all system memory, it is recommended to enable the autoSizingReserved setting so that the reserved memory is based off of the total memory size.
Note that if the SystemMemoryExceedsReservation alert is firing, that is also a good indication that the auto sizing values could be helpful in preventing memory starvation on the host node for system activities. See the documentation for instructions on how to enable this auto sizing setting. Also, this solution article explains how the auto recommendation values are calculated.
ODF pod limits
OpenShift Data Foundation, when configured in internal mode, provides performance profile options which control the amount of requests and limits that ODF pods will utilize, in some cases with high speed storage devices this can improve performance especially based on the resources the OSD pods can utilize to perform I/O.
In some very high scale cases, certain OpenShift Data Foundation pod memory limits may need to be raised, see the documentation on changing resources as well as this KCS article for more details.
Guide: VM tuning
Virt control plane tuning
Some tuning options are available at the Virtualization control plane level, including increasing “burst” rates to allow creation of hundreds of VMs in a single batch and adjusting the migration settings based on the workload type.
Virt rate limiting
To compensate for large scale “burst” rates, scale up the QPS (Queries per Second) and Burst RateLimits for the virt client components (api, controller, handler, and webhook). This allows more client requests or API calls to be processed concurrently for each component preventing slow VM creation time, as can be seen by checking the ratelimiter metric (rest_client_rate_limiter_duration_seconds_bucket).
To apply the scalable tuning parameters recommended for large scale "High Burst" scenarios (i.e. creating many VMs at once or in large batches), enable this profile which applies tested and safe QPS and burst values for all Virt components:
$ oc patch hyperconverged kubevirt-hyperconverged -n openshift-cnv --type=json -p='[{"op": "add", "path": "/spec/tuningPolicy", "value": "highBurst"}]'
VM migration tuning
Live migration allows a running Virtual Machine Instance (VMI) to move to another node without interrupting the workload. Migration can be helpful for a smooth transition during cluster upgrades or any time a node needs to be drained for maintenance or configuration changes.
Live migration requires the use of a shared storage solution that provides ReadWriteMany (RWX) access mode, and that VM disks are backed by volumes defined as such. OpenShift Virtualization will check that a VMI is live migratable and if so the evictionStrategy will be set to LiveMigrate. See About live migration for details.
Starting in 4.19, by default VM live migrations will utilize multiple (8) network threads, known as “multifd”, which can significantly increase the network bandwidth the migration is able to achieve (likely reducing the migration completion time). Note: if the VM configures a CPU limit (or dedicatedCpuPlacement), this multi threaded behavior will not be automatically configured for the VM migration in order to honor the CPU resource limits. Note that postCopy must be off to enable multifd.
Migration limits
Cluster-wide live migration limit settings can be adjusted based on the type of workload and migration scenario to control how many VMs migrate in parallel, how much network bandwidth can be used by each migration, and how long the migration will be attempted.
Even when multiple VMs are migrating in parallel per node, typically it is best to let the VMs use the most available bandwidth to complete migration as quickly as possible. In some cases it may be desired to “reserve” a portion of the link bandwidth for other critical traffic, in that case adjusting the bandwidthPerMigration limit can be helpful to limit the rate available for migration transfers. Note that by default the value is “0”, or unlimited.
However, if you have a very large VM running a heavy workload (for example database processing), the memory dirty rates may require higher bandwidth to complete transfer. For these heavy load migration cases, consider the following:
- Configure a single VM per node in parallel by setting
parallelOutboundMigrationsPerNodeto 1, so that all available bandwidth is used by a single migration at a time and ensure that thebandwidthPerMigrationlimit is using the default value (“0” or unlimited), or is set as high as possible for the link speed while allowing for other critical network traffic bandwidth needs. - Configure a dedicated migration network if possible, ideally with one or more high speed NICs
- Test and understand how fast the workload’s write to memory (dirty rate) is, to establish if the network speed is adequate. See How to estimate the dirty rate of a Virtual Machine without triggering a Live Migration in OpenShift Virtualization?
- If the dirty rate is generally close to or higher than the network speed, the migration is unlikely to converge during some migrations. In this case consider the following options:
- If the migration is configured over network bonds (2+ NICs), ensure the hashing algorithm uses Layer 4 information (TCP ports). This can drastically improve the network bandwidth by spreading the TCP connections over all NICs of the bond when multiple connections are used (multifd).
- If the migration is configured over a single or more high speed NICs (40G+), multiple connections (multifd) are likely required to make the best use of it.
- Enabling post-copy mode:
allowPostCopy: trueand loweringcompletionTimeoutPerGiBto trigger postCopy mode sooner can be useful in situations where increasing the network speed is not possible. - If the network speed is substantially higher than the VMs dirty rate, auto-converge can be enabled as a backup option in the unlikely cases where the migration needs to be forced; it is compatible with multifd.
Note: When post-copy mode kicks in during a migration (after the initial pre-copy phase does not complete during the configured timeouts), the VM CPUs are paused on the source host to transfer the minimum required memory pages, then the VM CPUs are activated on the destination host and the remaining memory pages are faulted into the destination node at runtime which may cause some performance impact during the transfer. Post-copy mode is not recommended for critical data or unstable networks.
Note: allowAutoConverge: true will gradually throttle the Virtual Machine vCPUs until the dirty rate is low enough to force the migration to converge. This option is not generally recommended because the vCPU throttling can lead to significant VM workload performance impacts, however it may be useful when forcing completion of the migration is more important than maintaining workload performance since ultimately this will throttle the VM speed until it gets lower than the network bandwidth. This option is not recommended to be used in slow migration network clusters, or in scenarios where the network speed is unknown or may fluctuate, as the Virtual Machine may face performance issues if the current network speed is low. The throttling starts at 20% and increases by 10% at each migration iteration. Make sure the network speeds and the workload dirty rate are well understood before using this option.
Migration network
By default, all VM migration network transfers occur over the cluster pod network, which will incur some overhead costs as the traffic is encrypted using Transport Layer Security (TLS). When the environment and application allows, using a dedicated additional network for VM migration can significantly improve network and migration performance. See Configuring a dedicated network for live migration for details.
VM host configuration tuning
Pinning
There are multiple pinning-related operators available at the host level: CPU Manager, Topology Manager, and Memory Manager.
CPU Manager
Enabling CPU Manager with a “static” policy on nodes allows VMs that request dedicated CPUs to be automatically pinned using a cpuset, it will also remove those CPUs from the shared pool ensuring that they are not shared with other VMs/pods.
CPU Manager can be enabled through a KubeletConfig. See Using CPU Manager and Topology Manager for instructions.
reservedSystemCPUs
Optionally, when configuring CPU Manager, the reservedSystemCPUs setting can be added to reserve specific CPUs for system work (OS and Kubernetes daemons), which can also be used in conjunction with CPU partitioning style tuning if needed for very low latency requirements.
Example KubeletConfig to apply CPU Manager along with the reservedSystemCPUs setting:
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: cpumanager-enabled
spec:
machineConfigPoolSelector:
matchLabels:
custom-kubelet: cpumanager-enabled
kubeletConfig:
cpuManagerPolicy: static
cpuManagerReconcilePeriod: 5s
reservedSystemCPUs: "0,1"
Topology Manager
Topology Manager allows for further control of the pinning behavior at the node level by using “hint providers” (examples: SRIOV, GPU operators) to provide preferred NUMA node information to CPU Manager when the cpuset is selected, based on the topology policy that is configured.
Topology Manager can also be configured using a KubeletConfig; see Using CPU Manager and Topology Manager for details.
Memory Manager
When VMs are small enough to fit within a NUMA node, Memory Manager can be used in conjunction with the Topology Manager policies of single-numa-node or restricted to ensure memory is also affinitized to a preferred NUMA node. See the NUMA-aware scheduling section for more information.
If a VM is large enough to span multiple NUMA nodes on a system, the memory affinitization can be controlled through a VM option called guestMappingPassthrough instead, see the VM Tuning section for more information.
Memory Manager can be enabled by configuring the “static” policy using a kubeletConfig similar to the CPU and Topology Managers. Configuring Memory Manager also requires setting a reservedMemory value for at least one NUMA node. This value should equal the sum of all reserved memory for a node (kube-reserved + system-reserved + eviction-threshold). This value can be calculated by comparing a node’s total reserved value: Memory Capacity - Memory Allocatable. Example kubeletConfig syntax is provided below:
kubeletConfig:
cpuManagerPolicy: static
cpuManagerReconcilePeriod: 5s
topologyManagerPolicy: single-numa-node
memoryManagerPolicy: Static
reservedMemory:
- numaNode: 0
limits:
memory: “1124Mi”
Node Tuning Operator
The Node Tuning Operator (NTO) helps you manage node-level tuning by orchestrating the TuneD daemon and achieves low latency performance by using the Performance Profile controller. The majority of high-performance applications require some level of kernel tuning. The Node Tuning Operator provides a unified management interface to users of node-level sysctls and more flexibility to add custom tuning specified by user needs. Learn more about how to tune the host for performance using the Node Tuning Operator.
VM configuration tuning
Customizing a VM template
VM templates for multiple different operating systems and flavors are provided by default, and are easily customizable. When creating a VM from a template, it is also possible to customize the VM by selecting a Workload Type, which includes a high-performance option that uses many of the tuning options mentioned in this section. The high-performance option exchanges the compatibility of the server and desktop workload types to provide better performance at the cost of requiring CPU Manager and virtio drivers in the guest.
High-performance workload type
The default VM templates provide a “high-performance” workload type option that can be used when certain conditions apply. The additional tuning options used when this workload type is selected are summarized as follows, each option is described in more details in later sections:
For all OS types:
bus: virtiodedicatedCpuPlacement: true[ CPUManager “static” policy is required ]isolateEmulatorThread: true[ CPUManager “static” policy is required ]networkInterfaceMultiqueue: true
For non-Windows OS types:ioThreadsPolicy: shareddedicatedIOThread: true
Windows guest tuning
Important: When using OpenShift Data Foundation (ODF) to provide storage for Windows VMs, to improve performance and prevent unnecessary data re-transmits due to "dummy" I/O page usage, see the Important note in the documentation about using a storage class with the "rxbounce" feature enabled.
This section will discuss tuning options available for Windows VMs.
Note: the default high-performance VM templates include some critical options for Windows performance, those options are highlighted here for informational purposes.
Hyper-V enlightenments
These Windows-specific settings instruct QEMU to use paravirtualized options that significantly improve performance, especially due to defining the best performing timer options.
The default Windows VM templates include all Hyper-V enlightenment features that are continuously tested and supported. See the example list below:
spec:
domain:
clock:
timer:
hpet:
present: false
hyperv: {}
pit:
tickPolicy: delay
rtc:
tickPolicy: catchup
utc: {}
# ...
features:
acpi: {}
apic: {}
hyperv:
frequencies: {}
ipi: {}
reenlightenment: {}
relaxed: {}
reset: {}
runtime: {}
spinlocks:
spinlocks: 8191
synic: {}
synictimer:
direct: {}
tlbflush: {}
vapic: {}
vpindex: {}
Virtualization Based Security
When enabling Virtualization Based Security (VBS), which is enabled by default in Windows VM templates, it is important to configure additional hyper-v enlightenment settings for best performance. On Intel CPUs, add the “evmcs” setting to the VM definition as shown below:
hyperv:
[...]
evmcs: {}
VirtIO drivers
For compatibility purposes, the default Windows templates define the VM disk using the sata bus and the VM network using the e1000e bus. It is highly recommended that users install the VirtIO drivers for Windows.
When VirtIO drivers are available, using the virtio bus for the VM disk and the VM network interface are highly recommended for performance, example:
devices:
disks:
- disk:
bus: virtio
# ...
interfaces:
- masquerade: {}
model: virtio
Note: bus: virtio requires the viostor driver, bus: scsi requires the vioscsi driver
Compute and memory
vCPU topology
The vCPU topology determines the scheduler domain layout inside a VM, which can have performance impacts. The desired vCPU topology can be specified in the VM definition using the CPU section:
cpu:
cores: 1
sockets: 4
threads: 1
By default, if no topology is specified, sockets will be used for best performance since in the case of Linux VMs, the scheduler domain used in that case includes a “sync” wakeup hint that can improve message passing performance. In general - as is the case with KVM - specifying a vCPU topology to match the host topology is typically only helpful if pinning is used. Otherwise, the default topology tends to perform well in most cases. When running Windows VMs, the sockets and cores topology may be adjusted to meet any license restrictions on total number of sockets since the “sync” scheduler hint is not present in the Windows guest OS.
CPU resources
To allow for CPU overcommit, which is set to 10:1 by default, each VM’s virt-launcher pod will define “100m” total CPU requests, which is 1/10th of a host CPU from a Kubernetes resource and scheduling perspective, per vCPU.
However this default 10:1 CPU overcommit ratio can be configured to the desired overcommit level by changing the cluster CPU Allocation Ratio. Changing this ratio will influence the amount of cpu requests each vCPU is allocated by default, which enforces a max level of CPU overcommit through Kubernetes requests scheduling. Note that resource assignments are made at virt-launcher pod scheduling time, so any VMs will need to be stopped and restarted to change CPU allocation behavior after a ratio change.
For cases where CPU overcommit is not desired for particular VMs, it is recommended to set the CPU resource requests near or equal to the total vCPU count to ensure the proper amount of resources are reserved from a scheduling perspective -- if the cluster CPU Allocation Ratio is not already configured to "1". Below is an example requesting 4 full cpus for a 4 vCPU VM:
cpu:
cores: 1
sockets: 4
threads: 1
# ...
resources:
requests:
cpu: 4
memory: 8Gi
Note: when using dedicatedCpuPlacement, requests are automatically configured based on the vCPU count.
Huge pages
By default, the host kernel provides Transparent Huge Page backing for VMs and pods. For some workloads, explicitly reserving hugepages in the host and backing the guest memory with hugepages can decrease the chances of TLB (Translation Lookaside Buffer) miss and help improve performance.
See Using huge pages with virtual machines for instructions on how to configure the huge page reservation and pageSize backing for the VM.
Pinning
Pinning a VM can improve performance in some scenarios, by aligning the CPU and memory affinity and by providing dedicated CPUs. There are multiple levels of pinning to consider, based on workload needs and the size of the VM.
Dedicated CPU placement
Using the dedicatedCpuPlacement setting instructs the virt-launcher pod to automatically configure requests and limits (to qualify for a Guaranteed QoS) based on the total amount of guest CPUs and memory requests. Note that when the High-performance Workload Type is selected when creating a VM from the default templates, the dedicatedCpuPlacement setting is enabled by default.
This setting requires CPU Manager to be enabled on the node: first the virt-launcher pod will receive a cpuset of dedicated CPUs, then inside the pod libvirt will pin each guest vCPU to a host CPU in the cpuset.
Dedicated emulator thread
Further, the isolateEmulatorThread setting is optional to pin both the emulator thread and IOthread (when one is requested) to a single host CPU. Note that this option will add one additional CPU to the total requests and limits that the virt-launcher pod defines, and relies on dedicatedCpuPlacement as well.
The High-performance Workload Type also enables isolateEmulatorThread by default when a VM is created from the provided templates.
NUMA pinning
For workloads that are sensitive to memory affinity, the VM can be pinned to a particular NUMA node using two different methods:
- If the VM fits within a host NUMA node, the Memory Manager policy configuration on the host can be used to ensure memory affinity aligns w/ CPU affinity.
- If the VM is large enough to span multiple host NUMA nodes, the
guestMappingPassthroughoption can be used to ensure NUMA-level pinning. Note that this option requires hugepage backing of the VM memory. When specified, the host NUMA topology is mapped to the guest using libvirt numa tune definitions, based on the cpuset the virt-launcher pod is provided by CPU Manager.
Below is example syntax of all of the pinning options covered in the previous sections:
cpu:
cores: 24
sockets: 4
threads: 1
dedicatedCpuPlacement: true
isolateEmulatorThread: true
numa:
guestMappingPassthrough : {}
Networking
networkInterfaceMultiqueue
For VMs with more than one vCPU, setting networkInterfaceMultiqueue to ‘true’ can improve VM network performance. This setting automatically adds queues equal to the number of vCPUs to the vhost-net device definition, allowing multiple guest CPUs to process softirq work. Note that when the High-performance Workload Type is selected when creating a VM from the default templates, networkInterfaceMultiqueue is enabled automatically.
Example syntax below:
interfaces:
- masquerade: {}
model: virtio
name: default
networkInterfaceMultiqueue: true
Multus
The default OVN-Kubernetes cluster network provides feature-rich software-defined networking capabilities including network isolation policies, IPsec encryption, and egress firewall and router, among others. OVN configures an Open vSwitch per node and uses Geneve encapsulation to provide the cluster overlay network.
In some cases, applications may benefit from the use of a separate data plane network option which can provide significantly better performance, especially in terms of latency. OpenShift Container Platform ships with the Mutus CNI plug-in, which allows additional networks to be configured for VMs and pods.
Linux bridge
To configure an additional network for VMs, a Linux bridge network can be created on the host and assigned, using multus, in the VM definition. See Connecting a virtual machine to a Linux bridge network for configuration details.
OVN-Kubernetes additional networks
OVN-Kubernetes also provides This content is not included.secondary network options, which can especially be useful when the host has a single network interface which cannot be used by additional networks directly. Note: Otherwise, to connect a bridge to the default OVN interface a VLAN must first be created on the host interface, see Attach the default NIC to a bridge while using OVN Kubernetes for more information.
Another network option OpenShift provides is UserDefinedNetworks for advanced network segmentation needs.
Consider connecting VMs to an OVN secondary network or a User Defined Network to benefit from their respective features.
Bonding
As the environment allows, consider if a bond could be created on host interfaces to increase throughput capabilities for a secondary network provided to VMs, see the documentation for supported bonding modes and how to configure bonds by using a NodeNetworkConfigurationPolicy.
SR-IOV
For strict high-performance network requirements Single Root I/O Virtualization (SR-IOV) devices can be configured and attached to VMs, using both the SRIOV operator and Multus CNI, providing them Virtual Function(s) capable of providing near-native performance achieved by bypassing the host networking stack.
See Connecting a virtual machine to an SR-IOV network for the configuration process.
Storage
Much of the storage-related tuning is applied automatically when safe. However, there are configurable tuning options as well.
Profiles
Storage profiles are available to configure the default provisioning behavior for the configured storage class. Note that the storage profile preferences are used when the spec.storage API is used, but not when using spec.pvc API.
Note: Generally speaking, when using a storage provider that supports block volumes, block mode provides better performance than file system mode. OpenShift Virtualization should configure the correct defaults for known storage providers, however the default volumeMode can be configured in a storage profile for each storage class, as in the following example using the OpenShift Data Foundation RBD class:
apiVersion: cdi.kubevirt.io/v1beta1
kind: StorageProfile
metadata:
name: ocs-storagecluster-ceph-rbd
spec: {}
status:
claimPropertySets:
- accessModes:
- ReadWriteMany
volumeMode: Block
- accessModes:
- ReadWriteOnce
volumeMode: Block
- accessModes:
- ReadWriteOnce
volumeMode: Filesystem
provisioner: openshift-storage.rbd.csi.ceph.com
storageClass: ocs-storagecluster-ceph-rbd
Cloning
For some storage providers, creating many clones of a "snapshot" image can provide a much better scaling experience than a 1-to-1 clone of a PVC source directly. This behavior will depend on how the CSI and the underlying storage handles clones and image relationships.
For example, when using OpenShift Data Foundation, the default Storage Profile will configure the cloning strategy as csi-clone, however there are limitations on how many clones per PVC can be created before background flattening processes occur, which can drastically slow down clone creation at scale performance.
When scaling to 100s of clones using a single source and supported/recommended by the storage provider, it is recommended to use a VolumeSnapshot cloning method instead of the default csi-clone to improve performance.
To use this method, first create a VolumeSnaphot of the source image:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: golden-volumesnapshot
namespace: golden-ns
spec:
volumeSnapshotClassName: ocs-storagecluster-rbdplugin-snapclass
source:
persistentVolumeClaimName: golden-snap-source
And then reference that snapshot as the DataVolume clone source, instead of a direct pvc source:
spec:
source:
snapshot:
name: golden-volumesnapshot
namespace: golden-ns
Preallocation
For cases where write performance can be improved by preallocation or “thick” provisioning, Containerized Data Importer (CDI) provides a preallocation option when creating a data volume.
Disk I/O modes
Using “native” I/O mode (which uses kernel asynchronous I/O) can often provide better performance than “threads” mode (which uses user space threads), this setting can be controlled by the io option in the VM disk definition:
disk:
bus: virtio
io: native
However this option can cause issues if used with a sparse disk since it can block the I/O event loop when a write happens to a not fully allocated disk and filesystem metadata needs to be updated. Because of this, OpenShift Virtualization will automatically add io: native when a block device or preallocated disk is used, which prevents the user from having to explicitly define this mode.
IOThreads and Block multi-queue
IOThreads are an option to improve storage performance and scalability of block I/O requests, especially when used in conjunction with block multi-queue.
The ioThreadsPolicy controls the behavior per VM:
- The “shared” policy means all VM disks will share a single IOthread, unless
dedicatedIOThread: trueis specified for a particular disk. This can be useful in cases where a VM has multiple disks but only a specific disk is expected to have heavy I/O load. Example syntax below:
domain:
ioThreadsPolicy: shared
[...]
devices:
- disk:
bus: virtio
name: datadisk
dedicatedIOThread: true
- The “auto” policy can be used to allocate a pool of threads (up to twice the number of total vcpus), automatically assigning a single dedicated IOthread to each disk device. This option can be useful when a VM has multiple disks often performing single threaded I/O workloads.
- Starting in 4.19, the supplementalPool policy provides an additional optimization that allows VM disk I/O to be spread among multiple submission threads, which are also mapped to multiple disk queues inside the VM. This option can be used to allocate a pool of threads, shared among all disks in the VM, assigning multiple IOthreads per each disk device. This tuning applies to
bus: virtiodisks, when using Block volumeMode (or Filesystem mode with Preallocation in order to enableio: nativebehavior), and requires theblockMultiQueue: truesetting which allows the submission threads to map to queues inside the guest. More details are provided in this This content is not included.feature introduction blog post. Example syntax below:
spec:
domain:
cpu:
cores: 1
sockets: 16
threads: 1
memory:
guest: 16Gi
ioThreadsPolicy: supplementalPool
ioThreads:
supplementalPoolThreadCount: 4
devices:
blockMultiQueue: true
disks:
- name: rootdisk
disk:
bus: virtio