The control-plane-machine-set-operator pod is in CrashLoopBackOff state during RHOCP 4 cluster upgrade to version 4.15
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4.15
- Control Plane Machine Set
- VMware vSphere
Issue
- The
control-plane-machine-set-operatorpod is restarting multiple times with the followingnil pointerruntime error while upgrading to OpenShift 4.15:
"msg"="Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference" "controller"="controlplanemachineset" "reconcileID"="3799e713-b2da-4a67-83c1-2541c1c9a57e"
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1a5911c]
Resolution
This issue has been reported to Red Hat Engineering. It is being tracked in Bug This content is not included.OCPBUGS-31808 and already fixed in OpenShift 4.15.11:
| Target Minor Release | Bug | Fixed Version | Errata |
|---|---|---|---|
| 4.16 | This content is not included.OCPBUGS-31808 | N/A | N/A |
| 4.15 | This content is not included.OCPBUGS-32414 | 4.15.11 | RHSA-2024:2068 |
If the cluster is stuck upgrading to a previous OpenShift 4.15 release, a workaround is required.
Workaround for older 4.15 releases
If this issue is happening during the cluster upgrade to an OpenShift 4.15 version without the fix, and the upgrade is stuck due to this issue, to let the upgrade progress, take a backup of controlplanemachineset CR and try to recreate it by removing the failureDomains fields from spec.template.machines_v1beta1_machine_openshift_io as follows:
$ oc get controlplanemachineset.machine.openshift.io cluster --namespace openshift-machine-api -o yaml > controlplanemachineset_backup.yaml
$ oc delete controlplanemachineset.machine.openshift.io cluster --namespace openshift-machine-api
Now simply update the controlplanemachineset configurations just removing the following two fields from the configurations and recreate the controlplanemachineset CR with updated configurations:
$ vi controlplanemachineset_backup.yaml
failureDomains: <======== Remove this field
platform: "" <======== Remove this field
$ oc create -f controlplanemachineset_backup.yaml
Please note that with the above change, the control-plane-machine-set-operator pod will start and let the upgrade finish. However, some errors could happen while reconciling the controlpanemachineset, as below in the operator pod logs, as well as in the status of controlplanemachineset CR:
"msg"="Reconciler error" "error"="error reconciling control plane machine set: failed to update control plane machine set: admission webhook \"controlplanemachineset.machine.openshift.io\" denied the request: spec.template.machines_v1beta1_machine_openshift_io.spec.providerSpec.value.template: Invalid value: \"test123-rhcos\": template must be provided as the full path" "controller"="controlplanemachineset" "reconcileID"="617cd234-ce4d-4ac2-80f7-8a0463b40b6b"
Another bug This content is not included.OCPBUGS-32357 was opened for it and also fixed in OpenShift 4.15.11.
If that last error happens in a 4.15 version without the fix, the workaround would be to delete the validatinwebhookconfiguration controlplanemachineset.machine.openshift.io temporarily and recreate it after the controlplanemachineset is created.
$ oc get validatingwebhookconfigurations controlplanemachineset.machine.openshift.io -o yaml > validatingwebhook_controlplanemachineset.yaml
$ oc delete validatingwebhookconfigurations controlplanemachineset.machine.openshift.io
$ oc create -f <controlplanemachineset-file-name.yaml>
$ oc create -f validatingwebhook_controlplanemachineset.yaml
Root Cause
The controlplanemachineset CR with nil/empty value for the failureDomain triggers a nil pointer error in the code.
As per the This page is not included, but the link has been rewritten to point to the nearest parent document.sample VMware vSphere failure domain configuration documentation, defining a failure domain for a control plane machine set in vSphere is a Technology Preview feature only, and defining the failure domains for Vsphere platform was introduced in Openshift 4.15 release only and at the time of writing this Solution, code is missing the handling of the upgrade scenario where the failure domain field doesn't contain a valid value.
Diagnostic Steps
- Check the
control-plane-machine-set-operatorpod state and see if it's inCrashLoopBackOffstate:
$ oc get pods -n openshift-machine-api | grep 'control-plane-machine-set'
openshift-machine-api control-plane-machine-set-operator-xxxx 0/1 CrashLoopBackOff 9 1h
- Describe the
controlplanemachinesetoperator pod and look for hints for the reason ofCrashLoopBackOff:
$ oc describe pods/<control-plane-machine-set'-operator-pod-name> -n openshift-machine-api
[..]
containerStatuses:
- containerID: cri-o://d6eefd961f71273d5a6558aab2f81a385da836f952f9431d8e485d35d6a78c64
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9a1426b791d89d0bb2be1ef49d2a5b401f0a741fa7df9252d52f863f0ceabd04
imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9a1426b791d89d0bb2be1ef49d2a5b401f0a741fa7df9252d52f863f0ceabd04
lastState:
terminated:
containerID: cri-o://d6eefd961f71273d5a6558aab2f81a385da836f952f9431d8e485d35d6a78c64
exitCode: 2
finishedAt: "2024-04-04T09:32:23Z"
reason: Error
startedAt: "2024-04-04T09:29:58Z"
name: control-plane-machine-set-operator
ready: false
restartCount: 9
started: false
state:
waiting:
message: back-off 5m0s restarting failed container=control-plane-machine-set-operator
pod=control-plane-machine-set-operator-c674d9976-z6d6g_openshift-machine-api(1e54ff4b-ebec-46fd-9af0-04dc529e5cc5)
reason: CrashLoopBackOff
[..]
- Check the
control-plane-machine-set-operatorpod logs and see if it's throwing aruntime error: invalid memory address or nil pointer dereferencefor the failure domain, as follows:
2024-04-04T09:32:23.597509741Z I0404 09:32:23.597426 1 watch_filters.go:179] reconcile triggered by infrastructure change
2024-04-04T09:32:23.606311553Z I0404 09:32:23.606243 1 controller.go:220] "msg"="Starting workers" "controller"="controlplanemachineset" "worker count"=1
2024-04-04T09:32:23.606360950Z I0404 09:32:23.606340 1 controller.go:169] "msg"="Reconciling control plane machine set" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="5dac54f4-57ab-419b-b258-79136ca8b400"
2024-04-04T09:32:23.609322467Z I0404 09:32:23.609217 1 panic.go:884] "msg"="Finished reconciling control plane machine set" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="5dac54f4-57ab-419b-b258-79136ca8b400"
2024-04-04T09:32:23.609322467Z I0404 09:32:23.609271 1 controller.go:115] "msg"="Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference" "controller"="controlplanemachineset" "reconcileID"="5dac54f4-57ab-419b-b258-79136ca8b400"
2024-04-04T09:32:23.612540681Z panic: runtime error: invalid memory address or nil pointer dereference [recovered]
2024-04-04T09:32:23.612540681Z panic: runtime error: invalid memory address or nil pointer dereference
2024-04-04T09:32:23.612540681Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1a5911c]
2024-04-04T09:32:23.612540681Z
2024-04-04T09:32:23.612540681Z goroutine 255 [running]:
2024-04-04T09:32:23.612540681Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
2024-04-04T09:32:23.612571624Z /go/src/github.com/openshift/cluster-control-plane-machine-set-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:116 +0x1fa
2024-04-04T09:32:23.612571624Z panic({0x1c8ac60, 0x31c6ea0})
2024-04-04T09:32:23.612571624Z /usr/lib/golang/src/runtime/panic.go:884 +0x213
2024-04-04T09:32:23.612571624Z github.com/openshift/cluster-control-plane-machine-set-operator/pkg/machineproviders/providers/openshift/machine/v1beta1/providerconfig.VSphereProviderConfig.ExtractFailureDomain(...)
2024-04-04T09:32:23.612571624Z /go/src/github.com/openshift/cluster-control-plane-machine-set-operator/pkg/machineproviders/providers/openshift/machine/v1beta1/providerconfig/vsphere.go:120
2024-04-04T09:32:23.612571624Z github.com/openshift/cluster-control-plane-machine-set-operator/pkg/machineproviders/providers/openshift/machine/v1beta1/providerconfig.providerConfig.ExtractFailureDomain({{0x1f2a71a, 0x7}, {{{{...}, {...}}, {{...}, {...}, {...}, {...}, {...}, {...}, ...}, ...}}, ...})
2024-04-04T09:32:23.612588145Z /go/src/github.com/openshift/cluster-control-plane-machine-set-operator/pkg/machineproviders/providers/openshift/machine/v1beta1/providerconfig/providerconfig.go:212 +0x23c
- Check and ensure that the
controlplanemachinesetCR hasplatform: ""for thefailureDomain:
$ oc get controlplanemachineset.machine.openshift.io cluster --namespace openshift-machine-api -oyaml
apiVersion: machine.openshift.io/v1
kind: ControlPlaneMachineSet
[..]
spec:
replicas: 3
selector:
matchLabels:
machine.openshift.io/cluster-api-cluster: test-xxxx
machine.openshift.io/cluster-api-machine-role: master
machine.openshift.io/cluster-api-machine-type: master
state: Active
strategy:
type: RollingUpdate
template:
machineType: machines_v1beta1_machine_openshift_io
machines_v1beta1_machine_openshift_io:
failureDomains: <======== This field
platform: "" <======== This field
metadata:
labels:
machine.openshift.io/cluster-api-cluster: test-xxxx
machine.openshift.io/cluster-api-machine-role: master
machine.openshift.io/cluster-api-machine-type: master
spec:
lifecycleHooks: {}
metadata: {}
providerSpec:
value:
apiVersion: machine.openshift.io/v1beta1
credentialsSecret:
name: vsphere-cloud-credentials
diskGiB: 120
kind: VSphereMachineProviderSpec
memoryMiB: 32768
metadata:
creationTimestamp: null
network:
devices:
- networkName: ocp-xxxx
numCPUs: 8
numCoresPerSocket: 4
snapshot: ""
template: xxx-123
userDataSecret:
name: master-user-data
workspace:
datacenter: Datacenter
datastore: iscsi_test
folder: /Datacenter/vm/xxx-123
resourcePool: /Datacenter/host/OpenShift/Resources
[..]
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.