The control-plane-machine-set-operator pod is in CrashLoopBackOff state during RHOCP 4 cluster upgrade to version 4.15

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4.15
  • Control Plane Machine Set
  • VMware vSphere

Issue

  • The control-plane-machine-set-operator pod is restarting multiple times with the following nil pointer runtime error while upgrading to OpenShift 4.15:
   "msg"="Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference" "controller"="controlplanemachineset" "reconcileID"="3799e713-b2da-4a67-83c1-2541c1c9a57e"
   panic: runtime error: invalid memory address or nil pointer dereference [recovered]
       panic: runtime error: invalid memory address or nil pointer dereference
   [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1a5911c]

Resolution

This issue has been reported to Red Hat Engineering. It is being tracked in Bug This content is not included.OCPBUGS-31808 and already fixed in OpenShift 4.15.11:

Target Minor ReleaseBugFixed VersionErrata
4.16This content is not included.OCPBUGS-31808N/AN/A
4.15This content is not included.OCPBUGS-324144.15.11RHSA-2024:2068

If the cluster is stuck upgrading to a previous OpenShift 4.15 release, a workaround is required.

Workaround for older 4.15 releases

If this issue is happening during the cluster upgrade to an OpenShift 4.15 version without the fix, and the upgrade is stuck due to this issue, to let the upgrade progress, take a backup of controlplanemachineset CR and try to recreate it by removing the failureDomains fields from spec.template.machines_v1beta1_machine_openshift_io as follows:

$ oc get controlplanemachineset.machine.openshift.io cluster --namespace openshift-machine-api -o yaml > controlplanemachineset_backup.yaml
$ oc delete controlplanemachineset.machine.openshift.io cluster --namespace openshift-machine-api

Now simply update the controlplanemachineset configurations just removing the following two fields from the configurations and recreate the controlplanemachineset CR with updated configurations:

$ vi controlplanemachineset_backup.yaml
	    failureDomains:                 <======== Remove this field
	      platform: ""                  <======== Remove this field

$ oc create -f controlplanemachineset_backup.yaml

Please note that with the above change, the control-plane-machine-set-operator pod will start and let the upgrade finish. However, some errors could happen while reconciling the controlpanemachineset, as below in the operator pod logs, as well as in the status of controlplanemachineset CR:

"msg"="Reconciler error" "error"="error reconciling control plane machine set: failed to update control plane machine set: admission webhook \"controlplanemachineset.machine.openshift.io\" denied the request: spec.template.machines_v1beta1_machine_openshift_io.spec.providerSpec.value.template: Invalid value: \"test123-rhcos\": template must be provided as the full path" "controller"="controlplanemachineset" "reconcileID"="617cd234-ce4d-4ac2-80f7-8a0463b40b6b"

Another bug This content is not included.OCPBUGS-32357 was opened for it and also fixed in OpenShift 4.15.11.
If that last error happens in a 4.15 version without the fix, the workaround would be to delete the validatinwebhookconfiguration controlplanemachineset.machine.openshift.io temporarily and recreate it after the controlplanemachineset is created.

$ oc get validatingwebhookconfigurations controlplanemachineset.machine.openshift.io -o yaml > validatingwebhook_controlplanemachineset.yaml
$ oc delete validatingwebhookconfigurations controlplanemachineset.machine.openshift.io
$ oc create -f <controlplanemachineset-file-name.yaml>
$ oc create -f validatingwebhook_controlplanemachineset.yaml

Root Cause

The controlplanemachineset CR with nil/empty value for the failureDomain triggers a nil pointer error in the code.

As per the This page is not included, but the link has been rewritten to point to the nearest parent document.sample VMware vSphere failure domain configuration documentation, defining a failure domain for a control plane machine set in vSphere is a Technology Preview feature only, and defining the failure domains for Vsphere platform was introduced in Openshift 4.15 release only and at the time of writing this Solution, code is missing the handling of the upgrade scenario where the failure domain field doesn't contain a valid value.

Diagnostic Steps

  1. Check the control-plane-machine-set-operator pod state and see if it's in CrashLoopBackOff state:
   $ oc get pods -n openshift-machine-api | grep 'control-plane-machine-set'
   openshift-machine-api                              control-plane-machine-set-operator-xxxx               0/1     CrashLoopBackOff   9          1h
  1. Describe the controlplanemachineset operator pod and look for hints for the reason of CrashLoopBackOff:
   $ oc describe pods/<control-plane-machine-set'-operator-pod-name> -n openshift-machine-api 
   [..]
   containerStatuses:
   - containerID: cri-o://d6eefd961f71273d5a6558aab2f81a385da836f952f9431d8e485d35d6a78c64
     image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9a1426b791d89d0bb2be1ef49d2a5b401f0a741fa7df9252d52f863f0ceabd04
     imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9a1426b791d89d0bb2be1ef49d2a5b401f0a741fa7df9252d52f863f0ceabd04
     lastState:
       terminated:
         containerID: cri-o://d6eefd961f71273d5a6558aab2f81a385da836f952f9431d8e485d35d6a78c64
         exitCode: 2
         finishedAt: "2024-04-04T09:32:23Z"
         reason: Error
         startedAt: "2024-04-04T09:29:58Z"
     name: control-plane-machine-set-operator
     ready: false
     restartCount: 9
     started: false
     state:
       waiting:
         message: back-off 5m0s restarting failed container=control-plane-machine-set-operator
   	pod=control-plane-machine-set-operator-c674d9976-z6d6g_openshift-machine-api(1e54ff4b-ebec-46fd-9af0-04dc529e5cc5)
         reason: CrashLoopBackOff
   [..]
  1. Check the control-plane-machine-set-operator pod logs and see if it's throwing a runtime error: invalid memory address or nil pointer dereference for the failure domain, as follows:
   2024-04-04T09:32:23.597509741Z I0404 09:32:23.597426       1 watch_filters.go:179] reconcile triggered by infrastructure change
   2024-04-04T09:32:23.606311553Z I0404 09:32:23.606243       1 controller.go:220]  "msg"="Starting workers" "controller"="controlplanemachineset" "worker count"=1
   2024-04-04T09:32:23.606360950Z I0404 09:32:23.606340       1 controller.go:169]  "msg"="Reconciling control plane machine set" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="5dac54f4-57ab-419b-b258-79136ca8b400"
   2024-04-04T09:32:23.609322467Z I0404 09:32:23.609217       1 panic.go:884]  "msg"="Finished reconciling control plane machine set" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="5dac54f4-57ab-419b-b258-79136ca8b400"
   2024-04-04T09:32:23.609322467Z I0404 09:32:23.609271       1 controller.go:115]  "msg"="Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference" "controller"="controlplanemachineset" "reconcileID"="5dac54f4-57ab-419b-b258-79136ca8b400"
   2024-04-04T09:32:23.612540681Z panic: runtime error: invalid memory address or nil pointer dereference [recovered]
   2024-04-04T09:32:23.612540681Z     panic: runtime error: invalid memory address or nil pointer dereference
   2024-04-04T09:32:23.612540681Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1a5911c]
   2024-04-04T09:32:23.612540681Z 
   2024-04-04T09:32:23.612540681Z goroutine 255 [running]:
   2024-04-04T09:32:23.612540681Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
   2024-04-04T09:32:23.612571624Z     /go/src/github.com/openshift/cluster-control-plane-machine-set-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:116 +0x1fa
   2024-04-04T09:32:23.612571624Z panic({0x1c8ac60, 0x31c6ea0})
   2024-04-04T09:32:23.612571624Z     /usr/lib/golang/src/runtime/panic.go:884 +0x213
   2024-04-04T09:32:23.612571624Z github.com/openshift/cluster-control-plane-machine-set-operator/pkg/machineproviders/providers/openshift/machine/v1beta1/providerconfig.VSphereProviderConfig.ExtractFailureDomain(...)
   2024-04-04T09:32:23.612571624Z     /go/src/github.com/openshift/cluster-control-plane-machine-set-operator/pkg/machineproviders/providers/openshift/machine/v1beta1/providerconfig/vsphere.go:120
   2024-04-04T09:32:23.612571624Z github.com/openshift/cluster-control-plane-machine-set-operator/pkg/machineproviders/providers/openshift/machine/v1beta1/providerconfig.providerConfig.ExtractFailureDomain({{0x1f2a71a, 0x7}, {{{{...}, {...}}, {{...}, {...}, {...}, {...}, {...}, {...}, ...}, ...}}, ...})
   2024-04-04T09:32:23.612588145Z     /go/src/github.com/openshift/cluster-control-plane-machine-set-operator/pkg/machineproviders/providers/openshift/machine/v1beta1/providerconfig/providerconfig.go:212 +0x23c
  1. Check and ensure that the controlplanemachineset CR has platform: "" for the failureDomain:
   $ oc get controlplanemachineset.machine.openshift.io cluster --namespace openshift-machine-api -oyaml
   apiVersion: machine.openshift.io/v1
   kind: ControlPlaneMachineSet
   [..]
   spec:
   replicas: 3
   selector:
     matchLabels:
       machine.openshift.io/cluster-api-cluster: test-xxxx
       machine.openshift.io/cluster-api-machine-role: master
       machine.openshift.io/cluster-api-machine-type: master
   state: Active
   strategy:
     type: RollingUpdate
   template:
     machineType: machines_v1beta1_machine_openshift_io
     machines_v1beta1_machine_openshift_io:
       failureDomains:                 <======== This field
         platform: ""                  <======== This field
       metadata:
         labels:
   	machine.openshift.io/cluster-api-cluster: test-xxxx
   	machine.openshift.io/cluster-api-machine-role: master
   	machine.openshift.io/cluster-api-machine-type: master
       spec:
         lifecycleHooks: {}
         metadata: {}
         providerSpec:
   	value:
   	  apiVersion: machine.openshift.io/v1beta1
   	  credentialsSecret:
   	    name: vsphere-cloud-credentials
   	  diskGiB: 120
   	  kind: VSphereMachineProviderSpec
   	  memoryMiB: 32768
   	  metadata:
   	    creationTimestamp: null
   	  network:
   	    devices:
   	    - networkName: ocp-xxxx
   	  numCPUs: 8
   	  numCoresPerSocket: 4
   	  snapshot: ""
   	  template: xxx-123
   	  userDataSecret:
   	    name: master-user-data
   	  workspace:
   	    datacenter: Datacenter
   	    datastore: iscsi_test
   	    folder: /Datacenter/vm/xxx-123
   	    resourcePool: /Datacenter/host/OpenShift/Resources
   	[..]
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.