[UPI vSphere] Node scale-up doesn't work as expected

Solution Unverified - Updated

Environment

  • Red Hat OpenShift (OCP) 4.x
  • vSphere 6.5, 6.7

Issue

  • The node scale-up process does not work as expected and gets stuck. Machines are created within the hypervisor but are never powered on, thus making manual intervention necessary.

Resolution

  • The vSphere's latency-sensitivity feature is a tunable provided by the hypervisor. Setting the latency sensitivity to 'high' may impact the VM's memory management and interfere with the daily administration tasks.
  • Please refer to and apply the necessary changes described in the following Content from kb.vmware.com is not included.Kbase article to address the problem.

Root Cause

Disclaimer: Links contained herein to an external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.

  • In this case, the events in the vSphere web client reveals that there is an invalid memory setting in the .OVA template file, which causes VMs to not power on.
    • Error string: error [Invalid memory setting: memory reservation (sched.mem.min) should be equal to memsize
  • Users should use the vSphere web client to determine the status of the virtual machine, and if it matches with the aforementioned error string, then apply the necessary changes described in the following Content from kb.vmware.com is not included.Kbase article to address the problem.
  • The vSphere Content from docs.vmware.com is not included.Latency Sensitivity setting could interfere with OCP tasks. Therefore, if you require the latency sensitivity to be set to 'high', you might find issues as described in this article.
  • If a node appears to be stuck in the 'Provisioning' state after scaling up a MachineSet, users should investigate the status of the virtual machine in the vSphere instance itself as well.

Diagnostic Steps

Let's start scaling up a node to get the full picture:

# oc scale machineset ocp4-d9g6x-worker --replicas=1 -n openshift-machine-api
# oc get machines
NAME                      PHASE         TYPE   REGION   ZONE   AGE
ocp4-d9g6x-worker-ch485   Provisioned                          69s                       

The VM will remain in "Provisioning" status.
Digging around this fact, we see the following event:

# oc get events --sort-by='{.lastTimestamp}' -n openshift-machine-api
LAST SEEN   TYPE      REASON            OBJECT                            MESSAGE
[...]
18s         Warning   FailedCreate      machine/ocp4-d9g6x-worker-ch485                 ocp4-d9g6x-worker-ch485: reconciler failed to Create machine: task task-4264 has not finished
18s         Normal    Create            machine/ocp4-d9g6x-worker-ch485                 Created Machine ocp4-d9g6x-worker-ch485
18s         Warning   FailedUpdate      machine/ocp4-d9g6x-worker-ch485                 ocp4-d9g6x-worker-ch485: reconciler failed to Update machine: task task-4264 has not finished

From vCenter, the VM shows as having been created but stopped, so let's check if there is a hint in the event viewer:

# curl -LO https://github.com/vmware/govmomi/releases/download/v0.24.0/govc_linux_amd64.gz
# gunzip govc_linux_amd64.gz
# chmod +x govc_linux_amd64
# cp govc_linux_amd64 /usr/bin/govc
# export GOVC_URL='vCenter IP OR FQDN'
# export GOVC_USERNAME='vCenter User'
# export GOVC_PASSWORD='vCenter Password'
# export GOVC_INSECURE=1 # If the host above uses a self-signed cert

# govc events /Datacenter/vm/ocp4-d9g6x/ocp4-d9g6x-worker-ch485
[...]
[Wed Jan 20 15:44:27 2021] [info] Clone of rhcos-vmware completed
[Wed Jan 20 15:44:27 2021] [info] ocp4-d9g6x-worker-ch485 on host 192.168.1.160 in Datacenter is starting
[Wed Jan 20 15:44:27 2021] [info] Virtual machine ocp4-d9g6x-worker-ch485 failed to power on after cloning on host 192.168.1.160 in datacenter Datacenter

And, if we try to power on the VM manually, we will get the reason behind this behavior:

# govc tasks /Datacenter/vm/ocp4-d9g6x/ocp4-d9g6x-worker-ch485
Task                                     Target                         Initiator                         Queued   Started Completed Result
Datacenter.ExecuteVmPowerOnLRO           ocp4-d9g6x-worker-ch485        Administrator                   14:55:29  14:55:29  14:55:30 error   [Invalid memory setting: memory reservation (sched.mem.min) should be equal to memsize(8192). ]
SBR
Components

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.