OpenShift's etcd cluster version was updated to 3.3

Solution Verified - Updated 14 Jun 2024

Environment

OpenShift Container Platform
- 3.9
- 3.10
- 3.11

Issue

After restart of the etcd static pod, version for the member shows 3.3.11 and not 3.2.x as expected
Image tag shows 3.2.22, but version from etcdctl command shows 3.3.11
etcd error: etcdserver/membership: cluster cannot be downgraded (current version: 3.2.22 is lower than determined cluster version: 3.3)

Resolution

OpenShift 3.x is not verified to work with etcd-3.3. A downgrade needs to happen from etcd-3.3 to etcd-3.2, in order for a downgrade to be successful etcd needs to be restored from a snapshot.

Downgrading the cluster to 3.2 will require a window of downtime for the cluster, while etcd is getting restored.

If you see this problem, please open new support case in Customer Portal

To confirm if your etcd cluster has been upgraded, look at the etcd logs to confirm if 3.3 capabilities have been enabled.

# docker logs `docker ps -q -l --filter "label=io.kubernetes.container.name=etcd"`  2>&1 
...
etcdserver/api: enabled capabilities for version 3.3
etcdserver/membership: set the cluster version to 3.3 from store

If only some of the etcd host were upgraded but a member is still running 3.2 and the cluster version shows as 3.2 a restore from snapshot is not needed, and the members can be downgraded without downtime. See the following KCS: This content is not included.How can I downgrade an etcd member when the cluster version is lower than the etcd memeber version.

Downgrade and Restore ETCD steps

Please follow these steps to downgrade and restore etcd from a snap shot. A step by step applied example follows these steps below.

# ETCD_ALL_ENDPOINTS=` etcdctl3 --write-out=fields   member list | awk '/ClientURL/{printf "%s%s",sep,$3; sep=","}'`
# etcdctl3 --endpoints=$ETCD_ALL_ENDPOINTS  endpoint status  --write-out=table

Capture a snapshot:

# etcdctl3 snapshot save /var/lib/etcd/snapshot.db

- From the host copy to a new location, as the above will run the etcdctl command in a container and save to a host mounted volume. We will delete the /var/lib/etcd path in later steps. 
# cp /var/lib/etcd/snapshot.db /tmp/snapshot.db

Back up existing db:

# cp /var/lib/etcd/member/snap/db /tmp/db

Copy snap shot to all etcd hosts.

scp /tmp/snapshot.db etcd_host:/tmp/snapshot.db

On all etcd members:

Ensure the correct image is pulled and referenced
Stop the docker and atomic-openshift-node services
Remove /var/lib/etcd

- Confirm etcd image tag in static pod manifest is using tag 3.2.22
# awk '/image/ {print $2}' /etc/origin/node/pods/etcd.yaml 
registry.redhat.io/rhel7/etcd:3.2.22

- Confirm latest etcd:3.2.22 image is running with the 3.2.22 rpm version. 
# docker pull registry.redhat.io/rhel7/etcd:3.2.22
# docker run --rm -it --entrypoint rpm registry.redhat.io/rhel7/etcd:3.2.22 -qa etcd
etcd-3.2.22-1.el7.x86_64

# systemctl stop docker atomic-openshift-node etcd
# rm -rf /var/lib/etcd

On all hosts before restoring from a snapshot, etcd needs to be installed so that the etcd cli can be used.

- Make sure the etcd rpm version is 3.2.22 
# rpm -qa etcd 

- If not installed install etcd-3.2.22
# yum install etcd-3.2.22

- If installed but version is etcd-3.3.x downgrade
# yum downgrade etcd-3.2.22

- If etcd gets install or downgraded  /var/lib/etcd will need to be removed again. 
# rm -rf /var/lib/etcd

With etcd stopped and /var/lib/etcd removed we can now restore from our snapshot.

It is very important that after each restore the clusterid is the same on every restored etcd hosts.
Do not start etcd until a restore has happends on each etcd host.
--initial-cluster-token and --initial-cluster option's value need to be the same on all restored hosts.
If restoring from the copied backup /var/lib/etcd/member/snap/db the option --skip-hash-check=true is needed. It is not needed if a snapshot was taken and is being used for the restore.

# source /etc/etcd/etcd.conf
# export ETCDCTL_API=3

- Confirm Value has all etcd hosts set with hostname=https://IP:2380
# echo -e "$ETCD_INITIAL_CLUSTER \n$ETCD_INITIAL_CLUSTER_TOKEN"

- If restoring from the snapshot.db run the following: 
# etcdctl snapshot restore /tmp/snapshot.db \
  --name $ETCD_NAME \
  --initial-cluster $ETCD_INITIAL_CLUSTER \
  --initial-cluster-token $ETCD_INITIAL_CLUSTER_TOKEN \
  --initial-advertise-peer-urls $ETCD_INITIAL_ADVERTISE_PEER_URLS \
  --data-dir /var/lib/etcd
  
- If restoring from the copied backup /var/lib/etcd/member/snap/db 
# etcdctl snapshot restore /tmp/db  \
  --name $ETCD_NAME \
  --data-dir /var/lib/etcd \
  --initial-cluster $ETCD_INITIAL_CLUSTER \
  --initial-cluster-token $ETCD_INITIAL_CLUSTER_TOKEN \
  --initial-advertise-peer-urls $ETCD_INITIAL_ADVERTISE_PEER_URLS \
  --skip-hash-check=true 

- If running etcd as a systemd unit (3.9 and below) change the ownership of /var/lib/etcd
# chown etcd:etcd -R /var/lib/etcd

Restore the context of the /var/lib/etcd

# restorecon -Rv /var/lib/etcd

Once restored start docker and atomic-openshift-node. If running etcd as a systemd unit (3.9 and below only), start the etcd service as well.

# systemctl start docker atomic-openshift-node

Confirm health of etcd and cluster

# ETCD_ALL_ENDPOINTS=` etcdctl3 --write-out=fields   member list | awk '/ClientURL/{printf "%s%s",sep,$3; sep=","}'`
# etcdctl3 --endpoints=$ETCD_ALL_ENDPOINTS  endpoint status  --write-out=table 

# oc get nodes,pods -n  kube-system

Example Run through with 3 ETCD hosts.

ETCD Hosts:

master1.etcd.com
master2.etcd.com
master3.etcd.com

# ssh master1.etcd.com
# ETCD_ALL_ENDPOINTS=` etcdctl3 --write-out=fields   member list | awk '/ClientURL/{printf "%s%s",sep,$3; sep=","}'`
# etcdctl3 --endpoints=$ETCD_ALL_ENDPOINTS  endpoint status  --write-out=table 
+-----------------------------------+------------------+---------+---------+-----------+-----------+------------+
|           ENDPOINT                |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------------+------------------+---------+---------+-----------+-----------+------------+
|     https://master1.etcd.com:2379 | d91b1c20df818655 |  3.3.11 |   17 MB |      true |         6 |       7863 |
|           https://10.0.88.33:2379 |  d35cfd2fedc078f |  3.3.11 |   17 MB |     false |         6 |       7863 |
|           https://10.0.88.22:2379 | c9624828ed10ae36 |  3.3.11 |   17 MB |     false |         6 |       7863 |
|           https://10.0.88.11:2379 | d91b1c20df818655 |  3.3.11 |   17 MB |      true |         6 |       7863 |
+-----------------------------------+------------------+---------+---------+-----------+-----------+------------+


# etcdctl3 snapshot save /var/lib/etcd/snapshot.db

# cp /var/lib/etcd/snapshot.db /tmp/snapshot.db
# cp /var/lib/etcd/member/snap/db /tmp/db

# scp /tmp/snapshot.db master2.etcd.com:/tmp/snapshot.db
# scp /tmp/snapshot.db master3.etcd.com:/tmp/snapshot.db

# systemctl stop docker atomic-openshift-node etcd 
# docker pull registry.redhat.io/rhel7/etcd:3.2.22
# rm -rf /var/lib/etcd

# ssh master2.etcd.com
# systemctl stop docker atomic-openshift-node etcd 
# docker pull registry.redhat.io/rhel7/etcd:3.2.22
# rm -rf /var/lib/etcd


# ssh master3.etcd.com
# systemctl stop docker atomic-openshift-node etcd 
# docker pull registry.redhat.io/rhel7/etcd:3.2.22
# rm -rf /var/lib/etcd


# ssh master1.etcd.com 
# rpm -qa etcd 
etcd-3.2.22-1.el7.x86_64
# source /etc/etcd/etcd.conf
# export ETCDCTL_API=3
# echo -e  "$ETCD_INITIAL_CLUSTER \n$ETCD_INITIAL_CLUSTER_TOKEN"
  master1.etcd.com=https://10.0.88.11:2380,master2.etcd.com=https://10.0.88.22:2380,master3.etcd.com=https://10.0.88.33:2380  
  etcd-cluster-1

# ETCDCTL_API=3 etcdctl snapshot restore /tmp/snapshot.db \
  --name master1.etcd.com \
  --initial-cluster master1.etcd.com=https://10.0.88.11:2380,master2.etcd.com=https://10.0.88.22:2380,master3.etcd.com=https://10.0.88.33:2380 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-advertise-peer-urls https://10.0.88.11:2380 \
  --data-dir /var/lib/etcd 
2019-02-05 12:49:04.103233 I | mvcc: restore compact to 2361744
2019-02-05 12:49:04.135995 I | etcdserver/membership: added member d35cfd2fedc078f [https://10.0.88.33:2380] to cluster 1a196dd3442fbe59
2019-02-05 12:49:04.136161 I | etcdserver/membership: added member c9624828ed10ae36 [https://10.0.88.22:2380] to cluster 1a196dd3442fbe59
2019-02-05 12:49:04.136267 I | etcdserver/membership: added member d91b1c20df818655 [https://10.0.88.11:2380] to cluster 1a196dd3442fbe59

# restorecon -Rv /var/lib/etcd

# ssh master2.etcd.com
# ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
  --name master2.etcd.com \
  --initial-cluster master1.etcd.com=https://10.0.88.11:2380,master2.etcd.com=https://10.0.88.22:2380,master3.etcd.com=https://10.0.88.33:2380 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-advertise-peer-urls https://10.0.88.22:2380 \
  --data-dir /var/lib/etcd 
2019-02-05 12:51:25.179801 I | mvcc: restore compact to 2356950
2019-02-05 12:51:25.193709 I | etcdserver/membership: added member d35cfd2fedc078f [https://10.0.88.33:2380] to cluster 1a196dd3442fbe59
2019-02-05 12:51:25.193745 I | etcdserver/membership: added member c9624828ed10ae36 [https://10.0.88.22:2380] to cluster 1a196dd3442fbe59
2019-02-05 12:51:25.193759 I | etcdserver/membership: added member d91b1c20df818655 [https://10.0.88.11:2380] to cluster 1a196dd3442fbe59

# restorecon -Rv /var/lib/etcd


# ssh master3.etcd.com
# ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
  --name master3.etcd.com \
  --initial-cluster master1.etcd.com=https://10.0.88.11:2380,master2.etcd.com=https://10.0.88.22:2380,master3.etcd.com=https://10.0.88.33:2380 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-advertise-peer-urls https://10.0.88.33:2380 \
  --data-dir /var/lib/etcd 
2019-02-05 12:53:06.612149 I | mvcc: restore compact to 2356950
2019-02-05 12:53:06.634761 I | etcdserver/membership: added member d35cfd2fedc078f [https://10.0.88.33:2380] to cluster 1a196dd3442fbe59
2019-02-05 12:53:06.634905 I | etcdserver/membership: added member c9624828ed10ae36 [https://10.0.88.22:2380] to cluster 1a196dd3442fbe59
2019-02-05 12:53:06.635001 I | etcdserver/membership: added member d91b1c20df818655 [https://10.0.88.11:2380] to cluster 1a196dd3442fbe59

# restorecon -Rv /var/lib/etcd

# ssh master1.etcd.com
# systemctl start docker atomic-openshift-node

# ssh master2.etcd.com
# systemctl start docker atomic-openshift-node

# ssh master3.etcd.com
# systemctl start docker atomic-openshift-node

# ssh master1.etcd.com
# ETCD_ALL_ENDPOINTS=` etcdctl3 --write-out=fields   member list | awk '/ClientURL/{printf "%s%s",sep,$3; sep=","}'`
# etcdctl3 --endpoints=$ETCD_ALL_ENDPOINTS  endpoint status  --write-out=table 
+-----------------------------------+------------------+---------+---------+-----------+-----------+------------+
|           ENDPOINT                |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------------+------------------+---------+---------+-----------+-----------+------------+
|     https://master1.etcd.com:2379 | d91b1c20df818655 |  3.2.22 |   17 MB |      true |         6 |       42   |
|           https://10.0.88.33:2379 |  d35cfd2fedc078f |  3.2.22 |   17 MB |     false |         6 |       42   |
|           https://10.0.88.22:2379 | c9624828ed10ae36 |  3.2.22 |   17 MB |     false |         6 |       42   |
|           https://10.0.88.11:2379 | d91b1c20df818655 |  3.2.22 |   17 MB |      true |         6 |       42   |
+-----------------------------------+------------------+---------+---------+-----------+-----------+------------+

Root Cause

This is bug in the image, as the image contains newer binary that 3.2.X.
The issue is being investigated in This content is not included.1672344

The latest etcd:3.2.22 image "registry.redhat.io/rhel7/etcd:3.2.22" contains etcd-3.3.11

# docker run --rm -it --entrypoint /bin/rpm  registry.redhat.io/rhel7/etcd@sha256:d9ee301257dd54a62219ba6090aebb298aae07941939ee9a555e02522bab82be -qa etcd
etcd-3.3.11-2.el7.x86_64

A downgrade need to happen to use etcd:3.2.22-18 image "registry.redhat.io/rhel7/etcd:3.2.22-18" which contains etcd-3.2.22

# docker run -it --entrypoint /bin/rpm  registry.redhat.io/rhel7/etcd@sha256:6f5b73f472277b9b3f66148bf20247e33f04121236ad25715c1c272af29e620c -qa etcd
etcd-3.2.22-1.el7.x86_64

Diagnostic Steps

Run following:

# ETCD_ALL_ENDPOINTS=` etcdctl3 --write-out=fields   member list | awk '/ClientURL/{printf "%s%s",sep,$3; sep=","}'`
# etcdctl3 --endpoints=$ETCD_ALL_ENDPOINTS  endpoint status  --write-out=table

SBR

Shift

Product(s)

Red Hat OpenShift Container Platform

Components

etcd

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.