OpenShift Virtualization migrations failing during workload upgrade or node eviction when approaching 400+ pending or undefined migrations

Solution Verified - Updated

Environment

  • OpenShift Virtualization

Issue

As the number of combined pending and undefined VM migrations approaches 400, migrations may begin to fail. As this number is exceeded, the risk increases.

Resolution

  • OpenShift Virtualization 4.19+ and 4.18.12+ contain the fix for this.
  • A workaround procedure has been tested, a scripted version of the procedure is attached at the bottom of this article

Manual Steps

  1. For failure during workload updates

    1. Record the current workloadUpdateMethods defined in the HyperConverged resource

      $ oc get hyperconverged kubevirt-hyperconverged -n openshift-cnv -ojsonpath='{.spec.workloadUpdateStrategy.workloadUpdateMethods}'
      
    2. Set the workloadUpdateMethods to an empty list: []

      $ oc patch hyperconverged kubevirt-hyperconverged -n openshift-cnv --type json -p '[{"op":"replace","path":"/spec/workloadUpdateStrategy/workloadUpdateMethods", "value":[]}]'
      
    3. Wait for the HyperConverged resource to become Upgradable and hco-webhook Deployment to become Available

      $ oc wait --for=condition=Upgradeable=False hyperconverged kubevirt-hyperconverged -n openshift-cnv --timeout 60s
      $ oc wait --for=condition=Upgradeable=True hyperconverged kubevirt-hyperconverged -n openshift-cnv --timeout 60s
      $ oc wait --for=condition=Available=True deployment hco-webhook -n openshift-cnv --timeout 60s
      
  2. Scale down replicas to zero for the following Deployments in the openshift-cnv Namespace: hco-operator, virt-operator, virt-controller, hco-webhook

    $ oc scale --replicas=0 deployment -n openshift-cnv hco-operator
    $ oc scale --replicas=0 deployment -n openshift-cnv virt-operator
    $ oc scale --replicas=0 deployment -n openshift-cnv virt-controller
    $ oc scale --replicas=0 deployment -n openshift-cnv hco-webhook
    
  3. Wait for virt-controller to scale to zero

    $ while [[ $(oc get pod -n openshift-cnv -l kubevirt.io=virt-controller --no-headers| wc -l) -gt 0 ]]; do sleep 5; done
    
  4. Wait for the number of virt-launcher pods to equal the number of VMs in a Running or Migrating status. This is to ensure that no VMs are migrating

    $ VM_COUNT=$(oc get vm -A --no-headers | grep -E '(Running|Migrating)' | wc -l)
    $ while [[ $(oc get pod -A -l kubevirt.io=virt-launcher --no-headers | grep Running | wc -l) -ne ${VM_COUNT} ]]; do sleep 5; done
    
  5. Delete all non-succeeded VirtualMachineInstanceMigrations, the finalizer specified in the metadata for each VMIM will need to be removed to successfully delete these objects

    $ oc get vmim -A --no-headers | grep -v Succeeded | awk '{ print $1 " " $2; }' | xargs -I{} bash -c 'oc delete vmim -n {} --wait=false'
    $ oc get vmim -A --no-headers | grep -v Succeeded | awk '{ print $1 " " $2; }' | xargs -I{} bash -c 'oc patch vmim -n {} --type=json --patch "[{ \"op\": \"replace\", \"path\": \"/metadata/finalizers\", \"value\": [] }]"'
    $ oc get vmim -A --no-headers | grep -v Succeeded | awk '{ print $1 " " $2; }' | xargs -I{} bash -c 'oc delete vmim -n {} --wait=true'
    
  6. Scale hco-operator Deployment back to a replica count of 1

    $ oc scale --replicas=1 deployment -n openshift-cnv hco-operator
    
  7. Wait for the HyperConverged resource to become Upgradable and hco-webhook Deployment to become Available

    $ oc wait --for=condition=Upgradeable=False hyperconverged kubevirt-hyperconverged -n openshift-cnv --timeout 60s
    $ oc wait --for=condition=Upgradeable=True hyperconverged kubevirt-hyperconverged -n openshift-cnv --timeout 60s
    $ oc wait --for=condition=Available=True deployment hco-webhook -n openshift-cnv --timeout 60s
    
  8. If step 1 was performed due to failure during workload updates

    1. Using the workloadUpdateMethods captured at step 1, update HyperConverged resource

            # Replace <WORKLOAD_UPDATE_METHODS> with the data captured at step 1
            $ oc patch hyperconverged kubevirt-hyperconverged -n openshift-cnv --type json -p \
      "[{\"op\":\"add\",\"path\":\"/spec/workloadUpdateStrategy/workloadUpdateMethods\", \"value\":<WORKLOAD_UPDATE_METHODS>}]"
      
    2. Wait for the HyperConverged resource to become Upgradable and hco-webhook Deployment to become Available

      $ oc wait --for=condition=Upgradeable=False hyperconverged kubevirt-hyperconverged -n openshift-cnv --timeout 60s
      $ oc wait --for=condition=Upgradeable=True hyperconverged kubevirt-hyperconverged -n openshift-cnv --timeout 60s
      $ oc wait --for=condition=Available=True deployment hco-webhook -n openshift-cnv --timeout 60s
      
  9. Repeat steps as necessary until all migrations are successful.

Root Cause

Diagnostic Steps

  • Check the current VirtualMachineInstanceMigration statuses
  • Number of "Succeeded" migrations has stopped increasing
  • Number of "Failed" migrations continues to increase
  • Combined number of "Pending" and "Undefined" migrations approaches or exceeds 400
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.