OpenShift Virtualization migrations failing during workload upgrade or node eviction when approaching 400+ pending or undefined migrations
Environment
- OpenShift Virtualization
Issue
As the number of combined pending and undefined VM migrations approaches 400, migrations may begin to fail. As this number is exceeded, the risk increases.
Resolution
- OpenShift Virtualization 4.19+ and 4.18.12+ contain the fix for this.
- A workaround procedure has been tested, a scripted version of the procedure is attached at the bottom of this article
Manual Steps
-
For failure during workload updates
-
Record the current
workloadUpdateMethodsdefined in theHyperConvergedresource$ oc get hyperconverged kubevirt-hyperconverged -n openshift-cnv -ojsonpath='{.spec.workloadUpdateStrategy.workloadUpdateMethods}' -
Set the
workloadUpdateMethodsto an empty list:[]$ oc patch hyperconverged kubevirt-hyperconverged -n openshift-cnv --type json -p '[{"op":"replace","path":"/spec/workloadUpdateStrategy/workloadUpdateMethods", "value":[]}]' -
Wait for the
HyperConvergedresource to becomeUpgradableandhco-webhookDeployment to becomeAvailable$ oc wait --for=condition=Upgradeable=False hyperconverged kubevirt-hyperconverged -n openshift-cnv --timeout 60s $ oc wait --for=condition=Upgradeable=True hyperconverged kubevirt-hyperconverged -n openshift-cnv --timeout 60s $ oc wait --for=condition=Available=True deployment hco-webhook -n openshift-cnv --timeout 60s
-
-
Scale down replicas to zero for the following Deployments in the
openshift-cnvNamespace:hco-operator, virt-operator, virt-controller, hco-webhook$ oc scale --replicas=0 deployment -n openshift-cnv hco-operator $ oc scale --replicas=0 deployment -n openshift-cnv virt-operator $ oc scale --replicas=0 deployment -n openshift-cnv virt-controller $ oc scale --replicas=0 deployment -n openshift-cnv hco-webhook -
Wait for
virt-controllerto scale to zero$ while [[ $(oc get pod -n openshift-cnv -l kubevirt.io=virt-controller --no-headers| wc -l) -gt 0 ]]; do sleep 5; done -
Wait for the number of
virt-launcherpods to equal the number of VMs in aRunningorMigratingstatus. This is to ensure that no VMs are migrating$ VM_COUNT=$(oc get vm -A --no-headers | grep -E '(Running|Migrating)' | wc -l) $ while [[ $(oc get pod -A -l kubevirt.io=virt-launcher --no-headers | grep Running | wc -l) -ne ${VM_COUNT} ]]; do sleep 5; done -
Delete all non-succeeded
VirtualMachineInstanceMigrations, the finalizer specified in the metadata for each VMIM will need to be removed to successfully delete these objects$ oc get vmim -A --no-headers | grep -v Succeeded | awk '{ print $1 " " $2; }' | xargs -I{} bash -c 'oc delete vmim -n {} --wait=false' $ oc get vmim -A --no-headers | grep -v Succeeded | awk '{ print $1 " " $2; }' | xargs -I{} bash -c 'oc patch vmim -n {} --type=json --patch "[{ \"op\": \"replace\", \"path\": \"/metadata/finalizers\", \"value\": [] }]"' $ oc get vmim -A --no-headers | grep -v Succeeded | awk '{ print $1 " " $2; }' | xargs -I{} bash -c 'oc delete vmim -n {} --wait=true' -
Scale
hco-operatorDeployment back to a replica count of1$ oc scale --replicas=1 deployment -n openshift-cnv hco-operator -
Wait for the
HyperConvergedresource to becomeUpgradableandhco-webhookDeployment to becomeAvailable$ oc wait --for=condition=Upgradeable=False hyperconverged kubevirt-hyperconverged -n openshift-cnv --timeout 60s $ oc wait --for=condition=Upgradeable=True hyperconverged kubevirt-hyperconverged -n openshift-cnv --timeout 60s $ oc wait --for=condition=Available=True deployment hco-webhook -n openshift-cnv --timeout 60s -
If step 1 was performed due to failure during workload updates
-
Using the
workloadUpdateMethodscaptured at step1, updateHyperConvergedresource# Replace <WORKLOAD_UPDATE_METHODS> with the data captured at step 1 $ oc patch hyperconverged kubevirt-hyperconverged -n openshift-cnv --type json -p \ "[{\"op\":\"add\",\"path\":\"/spec/workloadUpdateStrategy/workloadUpdateMethods\", \"value\":<WORKLOAD_UPDATE_METHODS>}]" -
Wait for the
HyperConvergedresource to becomeUpgradableandhco-webhookDeployment to becomeAvailable$ oc wait --for=condition=Upgradeable=False hyperconverged kubevirt-hyperconverged -n openshift-cnv --timeout 60s $ oc wait --for=condition=Upgradeable=True hyperconverged kubevirt-hyperconverged -n openshift-cnv --timeout 60s $ oc wait --for=condition=Available=True deployment hco-webhook -n openshift-cnv --timeout 60s
-
-
Repeat steps as necessary until all migrations are successful.
Root Cause
- Bug reported: This content is not included.Jira
- Known issue with the virt-controller queues unable to handle a large number of migrations.
Diagnostic Steps
- Check the current
VirtualMachineInstanceMigrationstatuses - Number of "Succeeded" migrations has stopped increasing
- Number of "Failed" migrations continues to increase
- Combined number of "Pending" and "Undefined" migrations approaches or exceeds 400
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.