Use of VMware VMotion with OpenShift Container Storage / OpenShift Data Foundation

Solution Unverified - Updated

Environment

  • OpenShift Container Platform 4.x
  • Red Hat OpenShift Container Storage/OpenShift Data Foundation 4.x
  • VMware Distributed Resources Scheduler (DRS)
  • VMware vMotion

Issue

After a vMotion virtual machine migration, a node/pods fails to mount any previously mounted OCS/ODF persistent volume.

Resolution

Ensure that VM UUIDs persist and do not change after a VM is moved. This can be achieved by setting the VMware property uuid.action = "keep" on the OpenShift VMs.

uuid.action = "keep" 

The process to do this is documented by VMware here: Content from kb.vmware.com is not included.keeping a UUID for a moved virtual machine. This seems to mitigate the issue around not being able to mount an existing PV. After implementing the above change on all the OpenShift VMs this solution can be tested by performing a forced vMotion of a OCD/ODF node.

There should be a way to globally enforce this change rather than having to do it manually, please discuss with a VMWare SME for this.

The above fix would probably only be relevant in environments where VMware is not using the local storage of the ESXi nodes and performing a Svmotion but instead uses external storage e.g. vSAN.

In addition to the VMX property modification to fix the storage mounting it would also be recommended to follow best practices as defined by VMware to setup VM-VM anti-affinity rules for the nodes running OCS/ODF: Content from docs.vmware.com is not included.vSphere DRS Design for a Red Hat OpenShift Workload Domain. This is unrelated to this specific issue, however in the bigger picture of OCS/ODF on VMWare with DRS/vMotion you would not want multiple storage nodes to end up on the same ESXi hypervisor, thus implementing VMWare anti affinity rules would negate this.

Note: At this time, vMotion/DRS is not tested with OpenShift Container Storage/OpenShift Data Foundation

Root Cause

When using VMware vMotion/DRS with internal OCS/ODF using dynamic storage provisioning sometimes when a node moved to another ESXi hypervisor pods will fail to mount any previously mounted OCS/ODF PV. After a VMotion the behaviour observed seems to be the connection from node/pods to the storage breaks.

Diagnostic Steps

  • Examine pods logs to view the failure to mount persistent volumes
  • Examine the VMware logs at the time of vMotion
  • Export the VMs vmx files before and after a vmotion to observe the machine's UUID change
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.