How do I restore from an etcd backup in OpenShift 3.9 and older?

Solution Verified - Updated

Environment

Issue

  • How do I restore etcd from a backup?

Resolution

Backup ETCD

The following needs only to be performed on one etcd host.

Capture a snapshot:

# etcdctl3 snapshot save /var/lib/etcd/snapshot.db
  • From the host copy to a new location, as the above will run the etcdctl command in a container and save to a host mounted volume. We will delete the /var/lib/etcd path in later steps.
# cp /var/lib/etcd/snapshot.db /tmp/snapshot.db

Back up existing db:

# cp /var/lib/etcd/member/snap/db /tmp/db

Restore ETCD v3

Please follow these steps restore etcd from a snapshot or exiting db file. A step by step applied example follows these steps below.

On all etcd members stop the etcd, master-api and master-controllers services.

# systemctl stop atomic-openshift-master-api atomic-openshift-master-controllers etcd

#  systemctl status atomic-openshift-master-api atomic-openshift-master-controllers etcd

Please note that master-api and master-controllers must be stopped as well. During the restore, no service must write to the etcd database.

Install etcd rpm package as it provides the etcdctl binary. If the package is not already present.

# yum install etcd

Remove /var/lib/etcd, or move the data to different location to create a backup

# mv /var/lib/etcd/member /tmp/etcd-backup-$(date +%d-%m-%y)

# rm -rf /var/lib/etcd

With etcd stopped and /var/lib/etcd removed we can now restore from our snapshot.

Run only if /var/lib/etcd is dedicated volume
Note that removing /var/lib/etcd might not be possible due to the fact that this directory is mounted to a dedicated volume. In this case, remove the directory contents and make sure to restore etcd to a temporary directory and move the restored files to /var/lib/etcd like so:

# rm -rf /var/lib/etcd/*
# etcdctl snapshot restore --data-dir /var/lib/etcd/restore # <---- to a temporary directory
# cd /var/lib/etcd
# mv restore/* .
# rm -rf restore/
  • It is very important that after each restore the clusterid is the same on every restored etcd hosts.
  • Do not start etcd until a restore has happenden on each etcd host.
  • Note that --initial-cluster-token and --initial-cluster option's value need to be the same on all restored hosts.
  • If restoring from the copied backup /var/lib/etcd/member/snap/db the option --skip-hash-check=true is needed. It is not needed if a snapshot was taken and is being used for the restore.

Before restore, please read the article fully to understand the commands and its purpose. Note that the etcd.conf will be sourced, meaning all the variables will be exported to your current session. Those variables are used during the restore. In some case the ETCD_INITIAL_CLUSTER is empty or contains only one member. This value should contain all the members in following format master1.example.com=https://10.0.0.1:2380,master2.example.com=https://10.0.0.2:2380,....

Source the variables from etcd config file

# source /etc/etcd/etcd.conf
# export ETCDCTL_API=3
  • Confirm Value has all etcd hosts set with hostname=https://IP:2380
  • If hosts are not correct make changes to /etc/etcd/etcd.conf "ETCD_INITIAL_CLUSTER"
# echo -e "$ETCD_INITIAL_CLUSTER \n$ETCD_INITIAL_CLUSTER_TOKEN"

Choose A or B depends if restore is from snapshot.db or db file.

A. If restoring from the snapshot.db run the following:

# etcdctl snapshot restore /tmp/snapshot.db \
  --name $ETCD_NAME \
  --initial-cluster $ETCD_INITIAL_CLUSTER \
  --initial-cluster-token $ETCD_INITIAL_CLUSTER_TOKEN \
  --initial-advertise-peer-urls $ETCD_INITIAL_ADVERTISE_PEER_URLS \
  --data-dir /var/lib/etcd

B, If restoring from the copied backup /var/lib/etcd/member/snap/db

# etcdctl snapshot restore /tmp/db  \
  --name $ETCD_NAME \
  --data-dir /var/lib/etcd \
  --initial-cluster $ETCD_INITIAL_CLUSTER \
  --initial-cluster-token $ETCD_INITIAL_CLUSTER_TOKEN \
  --initial-advertise-peer-urls $ETCD_INITIAL_ADVERTISE_PEER_URLS \
  --skip-hash-check=true 

Change the ownership of /var/lib/etcd

# chown etcd:etcd -R /var/lib/etcd

Restore the context of the /var/lib/etcd

# restorecon -Rv /var/lib/etcd

Once restored start etcd

# systemctl start etcd

# systemctl status etcd

Confirm health of etcd

# ETCD_ALL_ENDPOINTS=` etcdctl3 --write-out=fields   member list | awk '/ClientURL/{printf "%s%s",sep,$3; sep=","}'`
# etcdctl3 --endpoints=$ETCD_ALL_ENDPOINTS  endpoint status  --write-out=table 

Start the atomic-openshift-master-api and atomic-openshift-master-controllers services on each master.

# systemctl start atomic-openshift-master-api atomic-openshift-master-controllers

Check the health of the cluster

# oc get nodes,pods -n  kube-system

Diagnostic Steps

Example Run through with 3 ETCD hosts.

ETCD Hosts:

  • master1.etcd.com
  • master2.etcd.com
  • master3.etcd.com
# ssh master1.etcd.com
# ETCD_ALL_ENDPOINTS=` etcdctl3 --write-out=fields   member list | awk '/ClientURL/{printf "%s%s",sep,$3; sep=","}'`
# etcdctl3 --endpoints=$ETCD_ALL_ENDPOINTS  endpoint status  --write-out=table 
+-----------------------------------+------------------+---------+---------+-----------+-----------+------------+
|           ENDPOINT                |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------------+------------------+---------+---------+-----------+-----------+------------+
|     https://master1.etcd.com:2379 | d91b1c20df818655 |  3.2.22 |   17 MB |      true |         6 |       42   |
|           https://10.0.88.33:2379 |  d35cfd2fedc078f |  3.2.22 |   17 MB |     false |         6 |       42   |
|           https://10.0.88.22:2379 | c9624828ed10ae36 |  3.2.22 |   17 MB |     false |         6 |       42   |
|           https://10.0.88.11:2379 | d91b1c20df818655 |  3.2.22 |   17 MB |      true |         6 |       42   |
+-----------------------------------+------------------+---------+---------+-----------+-----------+------------+


# etcdctl3 snapshot save /var/lib/etcd/snapshot.db

# cp /var/lib/etcd/snapshot.db /tmp/snapshot.db
# cp /var/lib/etcd/member/snap/db /tmp/db

# scp /tmp/snapshot.db master2.etcd.com:/tmp/snapshot.db
# scp /tmp/snapshot.db master3.etcd.com:/tmp/snapshot.db

# systemctl stop etcd atomic-openshift-master-api atomic-openshift-master-controllers
# mv /var/lib/etcd/member /tmp/etcd-backup-$(date +%d-%m-%y)
# rm -rf /var/lib/etcd

# ssh master2.etcd.com
# systemctl stop etcd atomic-openshift-master-api atomic-openshift-master-controllers
# mv /var/lib/etcd/member /tmp/etcd-backup-$(date +%d-%m-%y)
# rm -rf /var/lib/etcd

# ssh master3.etcd.com
# systemctl stop etcd atomic-openshift-master-api atomic-openshift-master-controllers
# mv /var/lib/etcd/member /tmp/etcd-backup-$(date +%d-%m-%y)
# rm -rf /var/lib/etcd

# ssh master1.etcd.com 
# source /etc/etcd/etcd.conf
# export ETCDCTL_API=3
# echo -e  "$ETCD_INITIAL_CLUSTER \n$ETCD_INITIAL_CLUSTER_TOKEN"
  master1.etcd.com=https://10.0.88.11:2380,master2.etcd.com=https://10.0.88.22:2380,master3.etcd.com=https://10.0.88.33:2380  
  etcd-cluster-1

# ETCDCTL_API=3 etcdctl snapshot restore /tmp/snapshot.db \
  --name master1.etcd.com \
  --initial-cluster master1.etcd.com=https://10.0.88.11:2380,master2.etcd.com=https://10.0.88.22:2380,master3.etcd.com=https://10.0.88.33:2380 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-advertise-peer-urls https://10.0.88.11:2380 \
  --data-dir /var/lib/etcd 
2019-02-05 12:49:04.103233 I | mvcc: restore compact to 2361744
2019-02-05 12:49:04.135995 I | etcdserver/membership: added member d35cfd2fedc078f [https://10.0.88.33:2380] to cluster 1a196dd3442fbe59
2019-02-05 12:49:04.136161 I | etcdserver/membership: added member c9624828ed10ae36 [https://10.0.88.22:2380] to cluster 1a196dd3442fbe59
2019-02-05 12:49:04.136267 I | etcdserver/membership: added member d91b1c20df818655 [https://10.0.88.11:2380] to cluster 1a196dd3442fbe59

# chown etcd:etcd -R /var/lib/etcd

# restorecon -Rv /var/lib/etcd

# ssh master2.etcd.com
# ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
  --name master2.etcd.com \
  --initial-cluster master1.etcd.com=https://10.0.88.11:2380,master2.etcd.com=https://10.0.88.22:2380,master3.etcd.com=https://10.0.88.33:2380 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-advertise-peer-urls https://10.0.88.22:2380 \
  --data-dir /var/lib/etcd 
2019-02-05 12:51:25.179801 I | mvcc: restore compact to 2356950
2019-02-05 12:51:25.193709 I | etcdserver/membership: added member d35cfd2fedc078f [https://10.0.88.33:2380] to cluster 1a196dd3442fbe59
2019-02-05 12:51:25.193745 I | etcdserver/membership: added member c9624828ed10ae36 [https://10.0.88.22:2380] to cluster 1a196dd3442fbe59
2019-02-05 12:51:25.193759 I | etcdserver/membership: added member d91b1c20df818655 [https://10.0.88.11:2380] to cluster 1a196dd3442fbe59

# chown etcd:etcd -R /var/lib/etcd

# restorecon -Rv /var/lib/etcd

# ssh master3.etcd.com
# ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
  --name master3.etcd.com \
  --initial-cluster master1.etcd.com=https://10.0.88.11:2380,master2.etcd.com=https://10.0.88.22:2380,master3.etcd.com=https://10.0.88.33:2380 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-advertise-peer-urls https://10.0.88.33:2380 \
  --data-dir /var/lib/etcd 
2019-02-05 12:53:06.612149 I | mvcc: restore compact to 2356950
2019-02-05 12:53:06.634761 I | etcdserver/membership: added member d35cfd2fedc078f [https://10.0.88.33:2380] to cluster 1a196dd3442fbe59
2019-02-05 12:53:06.634905 I | etcdserver/membership: added member c9624828ed10ae36 [https://10.0.88.22:2380] to cluster 1a196dd3442fbe59
2019-02-05 12:53:06.635001 I | etcdserver/membership: added member d91b1c20df818655 [https://10.0.88.11:2380] to cluster 1a196dd3442fbe59

# chown etcd:etcd -R /var/lib/etcd

# restorecon -Rv /var/lib/etcd

# ssh master1.etcd.com
# systemctl start etcd
# systemctl status etcd

# ssh master2.etcd.com
# systemctl start etcd
# systemctl status etcd

# ssh master3.etcd.com
# systemctl start etcd
# systemctl status etcd

# ssh master1.etcd.com
# ETCD_ALL_ENDPOINTS=` etcdctl3 --write-out=fields   member list | awk '/ClientURL/{printf "%s%s",sep,$3; sep=","}'`
# etcdctl3 --endpoints=$ETCD_ALL_ENDPOINTS  endpoint status  --write-out=table 
+-----------------------------------+------------------+---------+---------+-----------+-----------+------------+
|           ENDPOINT                |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------------+------------------+---------+---------+-----------+-----------+------------+
|     https://master1.etcd.com:2379 | d91b1c20df818655 |  3.2.22 |   17 MB |      true |         6 |       42   |
|           https://10.0.88.33:2379 |  d35cfd2fedc078f |  3.2.22 |   17 MB |     false |         6 |       42   |
|           https://10.0.88.22:2379 | c9624828ed10ae36 |  3.2.22 |   17 MB |     false |         6 |       42   |
|           https://10.0.88.11:2379 | d91b1c20df818655 |  3.2.22 |   17 MB |      true |         6 |       42   |
+-----------------------------------+------------------+---------+---------+-----------+-----------+------------+

# ssh master1.etcd.com
# systemctl start atomic-openshift-master-api atomic-openshift-master-controllers
# systemctl status etcd atomic-openshift-master-api atomic-openshift-master-controllers

# ssh master2.etcd.com
# systemctl start atomic-openshift-master-api atomic-openshift-master-controllers
# systemctl status etcd atomic-openshift-master-api atomic-openshift-master-controllers

# ssh master3.etcd.com
# systemctl start atomic-openshift-master-api atomic-openshift-master-controllers
# systemctl status etcd atomic-openshift-master-api atomic-openshift-master-controllers

# oc get nodes,pods -n  kube-system
SBR
Components
Category
Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.