Upgrade process gets stuck in the etcd operator. Etcd pod crashes with the error "Init:CrashLoopBackOff"
Environment
- Red Hat OpenShift Container Platform (RHOCP) 4.12.
- Virtualized environments.
- Baremetal hosts.
Issue
- The error manifests once the cluster starts the upgrade from 4.12.z to 4.13.z.
- Cluster upgrade process doesn't move forward.
- Etcd pods keep crashing with the
Init:CrashLoopBackOff. - Either restarting the pod or the node doesn't fix the issue.
Resolution
The issue is most likely related to the underlying virtualization technology that is configured to hide certain CPU features. One of the reasons for this, is to enable live migration between hypervisors, or baremetal host with old cpu models that doesn't support newer cpu architecture levels.
The following solution explains in detail the problem and provides advice, since the way to fix it depends on the virtualization technology that is hosting the Openshift Cluster.
To fix the issue, the CPU compatibility setting need to be raised to a level of compatibility that admits x86-64-v2.
Root Cause
Openshift 4.13 Coreos and container images are based on Red Hat Enterprise Linux (RHEL) 9.2 which has a higher CPU requirement than older versions and it requires a CPU compatible with x86-64-v2 instruction set or higher.
Diagnostic Steps
- The upgrade process is stuck in the etcd operator:
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.12.30 True True 3h15m Unable to apply 4.13.27: wait has exceeded 40 minutes for these operators: etcd
- Checking the etcd operator is listed as
degraded=True:
$ oc get co etcd
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
etcd 4.12.30 True True True 307d
- Checking etcd pods, a member with the newest revision is CrashLooping:
$ oc get pods -n openshift-etcd
NAME READY STATUS RESTARTS AGE LABELS
etcd-master0.example.com 0/4 Init:CrashLoopBackOff 36 3h k8s-app=etcd,revision=12,app=etcd,etcd=true
etcd-master1.example.com 4/4 Running 4 132d revision=11,app=etcd,etcd=true,k8s-app=etcd
etcd-master2.example.com 4/4 Running 4 132d revision=11,app=etcd,etcd=true,k8s-app=etcd
installer-12-master0.example.com 0/1 Completed 0 3h app=installer
revision-pruner-11-master0.example.com 0/1 Completed 0 132d app=pruner
revision-pruner-11-master1.example.com 0/1 Completed 0 132d app=pruner
revision-pruner-11-master2.example.com 0/1 Completed 0 132d app=pruner
revision-pruner-12-master0.example.com 0/1 Completed 0 3h app=pruner
revision-pruner-12-master1.example.com 0/1 Completed 0 3h app=pruner
revision-pruner-12-master2.example.com 0/1 Completed 0 3h app=pruner
- Investigating the failed container shows a problem with one of the initcontainers:
$ oc get pod etcd-master0.example.com -o json |jq .status.initContainerStatuses
[
{
"containerID": "cri-o://5fe68e9e0fe4107b9b687f9cf2f7e7bd83e065913aaeedfc9f15e2b916a470e3",
"image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:977a89aa4b6d846c909990ed5a6b1f0f2be2e8d45c595ac4daae332ab914927e",
"imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:977a89aa4b6d846c909990ed5a6b1f0f2be2e8d45c595ac4daae332ab914927e",
"lastState": {
"terminated": {
"containerID": "cri-o://5fe68e9e0fe4107b9b687f9cf2f7e7bd83e065913aaeedfc9f15e2b916a470e3",
"exitCode": 127,
"finishedAt": "2024-01-17T09:06:28Z",
"message": "Fatal glibc error: CPU does not support x86-64-v2\n",
"reason": "Error",
"startedAt": "2024-01-17T09:06:28Z"
}
},
"name": "setup",
"ready": false,
"restartCount": 36,
"state": {
"waiting": {
"message": "back-off 5m0s restarting failed container=setup pod=etcd-master0.example.com_openshift-etcd(ea39f48300a27626ad2f9c9d038d911a)",
"reason": "CrashLoopBackOff"
}
}
},
{
"image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:977a89aa4b6d846c909990ed5a6b1f0f2be2e8d45c595ac4daae332ab914927e",
"imageID": "",
"lastState": {},
"name": "etcd-ensure-env-vars",
"ready": false,
"restartCount": 0,
"state": {
"waiting": {
"reason": "PodInitializing"
}
}
},
{
"image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:977a89aa4b6d846c909990ed5a6b1f0f2be2e8d45c595ac4daae332ab914927e",
"imageID": "",
"lastState": {},
"name": "etcd-resources-copy",
"ready": false,
"restartCount": 0,
"state": {
"waiting": {
"reason": "PodInitializing"
}
}
}
]
- The failing container is an initcontainer called setup, so logs can be obtained using oc logs:
$ oc logs etcd-master0.example.com -c setup
2024-01-17T09:06:28.544503155Z Fatal glibc error: CPU does not support x86-64-v2
One quick way to check if a node supports x86-64-v2 instruction set is executing the following command in the guest operating system:
$ oc debug node/master0-example.com
$ chroot /host
$ /lib64/ld-linux-x86-64.so.2 --help |grep "Subdirectories of glibc-hwcaps directories" -A5
Output when x86-64-v2 is supported:
Subdirectories of glibc-hwcaps directories, in priority order:
x86-64-v4
x86-64-v3 (supported, searched)
x86-64-v2 (supported, searched)
When x86-64-v2 is not supported they will show only the architectures but not won't show the supported word:
Subdirectories of glibc-hwcaps directories, in priority order:
x86-64-v4
x86-64-v3
x86-64-v2
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.