OpenShift Virtualization migrations failing during workload upgrade or node eviction when approaching 400+ pending or undefined migrations

Solution Verified - Updated 9 Sept 2025

Environment

OpenShift Virtualization

Issue

As the number of combined pending and undefined VM migrations approaches 400, migrations may begin to fail. As this number is exceeded, the risk increases.

Resolution

OpenShift Virtualization 4.19+ and 4.18.12+ contain the fix for this.
A workaround procedure has been tested, a scripted version of the procedure is attached at the bottom of this article

Manual Steps

For failure during workload updates

Record the current workloadUpdateMethods defined in the HyperConverged resource

$ oc get hyperconverged kubevirt-hyperconverged -n openshift-cnv -ojsonpath='{.spec.workloadUpdateStrategy.workloadUpdateMethods}'

Set the workloadUpdateMethods to an empty list: []

$ oc patch hyperconverged kubevirt-hyperconverged -n openshift-cnv --type json -p '[{"op":"replace","path":"/spec/workloadUpdateStrategy/workloadUpdateMethods", "value":[]}]'

Wait for the HyperConverged resource to become Upgradable and hco-webhook Deployment to become Available

$ oc wait --for=condition=Upgradeable=False hyperconverged kubevirt-hyperconverged -n openshift-cnv --timeout 60s
$ oc wait --for=condition=Upgradeable=True hyperconverged kubevirt-hyperconverged -n openshift-cnv --timeout 60s
$ oc wait --for=condition=Available=True deployment hco-webhook -n openshift-cnv --timeout 60s

Scale down replicas to zero for the following Deployments in the openshift-cnv Namespace: hco-operator, virt-operator, virt-controller, hco-webhook

$ oc scale --replicas=0 deployment -n openshift-cnv hco-operator
$ oc scale --replicas=0 deployment -n openshift-cnv virt-operator
$ oc scale --replicas=0 deployment -n openshift-cnv virt-controller
$ oc scale --replicas=0 deployment -n openshift-cnv hco-webhook

Wait for virt-controller to scale to zero

$ while [[ $(oc get pod -n openshift-cnv -l kubevirt.io=virt-controller --no-headers| wc -l) -gt 0 ]]; do sleep 5; done

Wait for the number of virt-launcher pods to equal the number of VMs in a Running or Migrating status. This is to ensure that no VMs are migrating

$ VM_COUNT=$(oc get vm -A --no-headers | grep -E '(Running|Migrating)' | wc -l)
$ while [[ $(oc get pod -A -l kubevirt.io=virt-launcher --no-headers | grep Running | wc -l) -ne ${VM_COUNT} ]]; do sleep 5; done

Delete all non-succeeded VirtualMachineInstanceMigrations, the finalizer specified in the metadata for each VMIM will need to be removed to successfully delete these objects

$ oc get vmim -A --no-headers | grep -v Succeeded | awk '{ print $1 " " $2; }' | xargs -I{} bash -c 'oc delete vmim -n {} --wait=false'
$ oc get vmim -A --no-headers | grep -v Succeeded | awk '{ print $1 " " $2; }' | xargs -I{} bash -c 'oc patch vmim -n {} --type=json --patch "[{ \"op\": \"replace\", \"path\": \"/metadata/finalizers\", \"value\": [] }]"'
$ oc get vmim -A --no-headers | grep -v Succeeded | awk '{ print $1 " " $2; }' | xargs -I{} bash -c 'oc delete vmim -n {} --wait=true'

Scale hco-operator Deployment back to a replica count of 1

$ oc scale --replicas=1 deployment -n openshift-cnv hco-operator

Wait for the HyperConverged resource to become Upgradable and hco-webhook Deployment to become Available

$ oc wait --for=condition=Upgradeable=False hyperconverged kubevirt-hyperconverged -n openshift-cnv --timeout 60s
$ oc wait --for=condition=Upgradeable=True hyperconverged kubevirt-hyperconverged -n openshift-cnv --timeout 60s
$ oc wait --for=condition=Available=True deployment hco-webhook -n openshift-cnv --timeout 60s

If step 1 was performed due to failure during workload updates

Using the workloadUpdateMethods captured at step 1, update HyperConverged resource

      # Replace <WORKLOAD_UPDATE_METHODS> with the data captured at step 1
      $ oc patch hyperconverged kubevirt-hyperconverged -n openshift-cnv --type json -p \
"[{\"op\":\"add\",\"path\":\"/spec/workloadUpdateStrategy/workloadUpdateMethods\", \"value\":<WORKLOAD_UPDATE_METHODS>}]"

Wait for the HyperConverged resource to become Upgradable and hco-webhook Deployment to become Available

$ oc wait --for=condition=Upgradeable=False hyperconverged kubevirt-hyperconverged -n openshift-cnv --timeout 60s
$ oc wait --for=condition=Upgradeable=True hyperconverged kubevirt-hyperconverged -n openshift-cnv --timeout 60s
$ oc wait --for=condition=Available=True deployment hco-webhook -n openshift-cnv --timeout 60s

Repeat steps as necessary until all migrations are successful.

Root Cause

Bug reported: This content is not included.Jira
Known issue with the virt-controller queues unable to handle a large number of migrations.

Diagnostic Steps

Check the current VirtualMachineInstanceMigration statuses
Number of "Succeeded" migrations has stopped increasing
Number of "Failed" migrations continues to increase
Combined number of "Pending" and "Undefined" migrations approaches or exceeds 400

SBR

Virtualization

Product(s)

Red Hat OpenShift Container Platform

Components

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.