Upgrade process gets stuck in the etcd operator. Etcd pod crashes with the error "Init:CrashLoopBackOff"

Solution Verified - Updated 17 May 2024

Environment

Red Hat OpenShift Container Platform (RHOCP) 4.12.
Virtualized environments.
Baremetal hosts.

Issue

The error manifests once the cluster starts the upgrade from 4.12.z to 4.13.z.
Cluster upgrade process doesn't move forward.
Etcd pods keep crashing with the Init:CrashLoopBackOff.
Either restarting the pod or the node doesn't fix the issue.

Resolution

The issue is most likely related to the underlying virtualization technology that is configured to hide certain CPU features. One of the reasons for this, is to enable live migration between hypervisors, or baremetal host with old cpu models that doesn't support newer cpu architecture levels.
The following solution explains in detail the problem and provides advice, since the way to fix it depends on the virtualization technology that is hosting the Openshift Cluster.
To fix the issue, the CPU compatibility setting need to be raised to a level of compatibility that admits x86-64-v2.

Root Cause

Openshift 4.13 Coreos and container images are based on Red Hat Enterprise Linux (RHEL) 9.2 which has a higher CPU requirement than older versions and it requires a CPU compatible with x86-64-v2 instruction set or higher.

Diagnostic Steps

The upgrade process is stuck in the etcd operator:

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.30   True        True          3h15m   Unable to apply 4.13.27: wait has exceeded 40 minutes for these operators: etcd

Checking the etcd operator is listed as degraded=True :

$ oc get co etcd
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
etcd   4.12.30   True        True          True       307d

Checking etcd pods, a member with the newest revision is CrashLooping:

$ oc get pods -n openshift-etcd

NAME                                     READY   STATUS                  RESTARTS   AGE    LABELS
etcd-master0.example.com                 0/4     Init:CrashLoopBackOff   36         3h     k8s-app=etcd,revision=12,app=etcd,etcd=true
etcd-master1.example.com                 4/4     Running                 4          132d   revision=11,app=etcd,etcd=true,k8s-app=etcd
etcd-master2.example.com                 4/4     Running                 4          132d   revision=11,app=etcd,etcd=true,k8s-app=etcd
installer-12-master0.example.com         0/1     Completed               0          3h     app=installer
revision-pruner-11-master0.example.com   0/1     Completed               0          132d   app=pruner
revision-pruner-11-master1.example.com   0/1     Completed               0          132d   app=pruner
revision-pruner-11-master2.example.com   0/1     Completed               0          132d   app=pruner
revision-pruner-12-master0.example.com   0/1     Completed               0          3h     app=pruner
revision-pruner-12-master1.example.com   0/1     Completed               0          3h     app=pruner
revision-pruner-12-master2.example.com   0/1     Completed               0          3h     app=pruner

Investigating the failed container shows a problem with one of the initcontainers:

$ oc  get pod  etcd-master0.example.com -o json |jq .status.initContainerStatuses

[
  {
    "containerID": "cri-o://5fe68e9e0fe4107b9b687f9cf2f7e7bd83e065913aaeedfc9f15e2b916a470e3",
    "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:977a89aa4b6d846c909990ed5a6b1f0f2be2e8d45c595ac4daae332ab914927e",
    "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:977a89aa4b6d846c909990ed5a6b1f0f2be2e8d45c595ac4daae332ab914927e",
    "lastState": {
      "terminated": {
        "containerID": "cri-o://5fe68e9e0fe4107b9b687f9cf2f7e7bd83e065913aaeedfc9f15e2b916a470e3",
        "exitCode": 127,
        "finishedAt": "2024-01-17T09:06:28Z",
        "message": "Fatal glibc error: CPU does not support x86-64-v2\n",
        "reason": "Error",
        "startedAt": "2024-01-17T09:06:28Z"
      }
    },
    "name": "setup",
    "ready": false,
    "restartCount": 36,
    "state": {
      "waiting": {
        "message": "back-off 5m0s restarting failed container=setup pod=etcd-master0.example.com_openshift-etcd(ea39f48300a27626ad2f9c9d038d911a)",
        "reason": "CrashLoopBackOff"
      }
    }
  },
  {
    "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:977a89aa4b6d846c909990ed5a6b1f0f2be2e8d45c595ac4daae332ab914927e",
    "imageID": "",
    "lastState": {},
    "name": "etcd-ensure-env-vars",
    "ready": false,
    "restartCount": 0,
    "state": {
      "waiting": {
        "reason": "PodInitializing"
      }
    }
  },
  {
    "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:977a89aa4b6d846c909990ed5a6b1f0f2be2e8d45c595ac4daae332ab914927e",
    "imageID": "",
    "lastState": {},
    "name": "etcd-resources-copy",
    "ready": false,
    "restartCount": 0,
    "state": {
      "waiting": {
        "reason": "PodInitializing"
      }
    }
  }
]

The failing container is an initcontainer called setup, so logs can be obtained using oc logs:

$ oc logs etcd-master0.example.com -c setup
2024-01-17T09:06:28.544503155Z Fatal glibc error: CPU does not support x86-64-v2

One quick way to check if a node supports x86-64-v2 instruction set is executing the following command in the guest operating system:

$ oc debug node/master0-example.com 
$ chroot /host
$ /lib64/ld-linux-x86-64.so.2 --help |grep "Subdirectories of glibc-hwcaps directories" -A5

Output when x86-64-v2 is supported:

Subdirectories of glibc-hwcaps directories, in priority order:
  x86-64-v4
  x86-64-v3 (supported, searched) 
  x86-64-v2 (supported, searched)

When x86-64-v2 is not supported they will show only the architectures but not won't show the supported word:

Subdirectories of glibc-hwcaps directories, in priority order:
  x86-64-v4
  x86-64-v3 
  x86-64-v2

SBR

Shift

Product(s)

Red Hat OpenShift Container Platform

Components

etcd
Node

Category

Upgrade

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.