How do I restore from an etcd backup in OpenShift 3.11?
Environment
-
Red Hat OpenShift Container Platform (RHOCP)
- 3.10
- 3.11
-
For previous versions please use article below:
How do I restore from an etcd backup in OpenShift 3.9 and older?
Issue
- How do I restore etcd from a backup?
- How to recover the etcd database if the backup/snapshot is not available?
Resolution
Preparation
- Before starting the restore, make sure to have etcd
snapshotor etcddbfile prepared. The snapshot of the etcd must be performed on healthy etcd. In case you don't have healthy etcd cluster, prepare thedbfile per Option B on the guide below.
How to create a backup/snapshot of etcd in OpenShift 3.11 - Make sure that you are not running into space issues in any of the masters.
Restore etcd v3
Please follow these steps to restore etcd from a snapshot or existing db file. A step by step example follows the instructions below.
1) Stopping etcd pods from starting unintentionally
On all etcd members stop the static pods.
IMPORTANT: The following command will cause all the OpenShift control plane to become inoperative until these services are restored.
# mkdir -p /etc/origin/node/pods-stopped
# mv /etc/origin/node/pods/* /etc/origin/node/pods-stopped
NOTE: The above command will stop the master-api and master-controllers pods along with etcd pod, which is required.
WARNING: During the restore, no service must write to the etcd database.
Before proceeding, confirm the etcd pod is stopped using the following command on each etcd host:
# docker ps | grep master-etcd
2) Install the etcdctl binary with the etcd package
Install the etcd rpm package as it provides the etcdctl binary.
# yum install etcd
Mask the etcd systemd service since the installation of the etcd package also creates the etcd systemd service, which is not used.
# systemctl mask etcd
3) Clear the /var/lib/etcd directory.
Move /var/lib/etcd/member data and create a backup.
# mv /var/lib/etcd/member /var/lib/etcd/etcd-backup-$(date +%d-%m-%y)
The /var/lib/etcd directory should NOT contain the member directory. Confirm!
# ls -l /var/lib/etcd
With etcd stopped we can now restore from our snapshot.
4) Running Restore.
Important note:
Before restoring, please read the article fully to understand the commands and their purpose. Note that /etc/etcd/etcd.conf will be sourced, meaning all the variables will be exported to your current session. Those variables are used during the restoration. In some cases the ETCD_INITIAL_CLUSTER variable has empty value or contains only one member. This value should contain all the members in the following format master1.example.com=https://10.0.0.1:2380,master2.example.com=https://10.0.0.2:2380,....
Make sure to:
- Run the steps simultaneously on each etcd member.
- The
--initial-cluster-tokenand--initial-clustervalues need to be the same on all etcd members. Make sure that value ofETCD_INITIAL_CLUSTERis identical on each etcd member. Fix the value in the/etc/etcd/etcd.conffile. - If restoring from the copied backup
/var/lib/etcd/member/snap/dbthe option--skip-hash-check=trueneeds to be passed. It is not needed if a snapshot was taken and is being used for the restore. - The result of running restore command must be identical on each etcd member. The cluster-id and member-id must match.
- Do not start etcd until a restore has been successfully executed each etcd member.
- If the restore is being done from the
snapshotordb, the file must be the same on each host. If restore is done fromdbfile, make sure to copy thedbfile from one member to other members to the/tmpdirectory.
4.1) Source the variables from /etc/etcd/etcd.conf config file.
# source /etc/etcd/etcd.conf
# export ETCDCTL_API=3
4.2) Run the following command to confirm that variable ETCD_INITIAL_CLUSTER contains all etcd members in the format hostname=https://IP:2380.
# echo -e "$ETCD_INITIAL_CLUSTER \n$ETCD_INITIAL_CLUSTER_TOKEN"
4.3) If the hostnames or ipaddresses are not correct, change the value of ETCD_INITIAL_CLUSTER in the /etc/etcd/etcd.conf file. Example below:
ETCD_INITIAL_CLUSTER=master1.example.com=https://10.0.88.11:2380,master2.example.com=https://10.0.88.22:2380,master3.example.com=https://10.0.88.33:2380
4.4) Running the restore command.
Choose Option A or Option B depending on whether restoration is from a snapshot.db file or actual db file.
-
Option A. Only run if restoring from the
snapshot.db!# etcdctl snapshot restore /tmp/snapshot.db \ --name $ETCD_NAME \ --initial-cluster $ETCD_INITIAL_CLUSTER \ --initial-cluster-token $ETCD_INITIAL_CLUSTER_TOKEN \ --initial-advertise-peer-urls $ETCD_INITIAL_ADVERTISE_PEER_URLS \ --data-dir /var/lib/etcd/restore -
Option B. Run only if restoring from the copied
dbfile in/tmp/db!# etcdctl snapshot restore /tmp/db \ --name $ETCD_NAME \ --data-dir /var/lib/etcd/restore \ --initial-cluster $ETCD_INITIAL_CLUSTER \ --initial-cluster-token $ETCD_INITIAL_CLUSTER_TOKEN \ --initial-advertise-peer-urls $ETCD_INITIAL_ADVERTISE_PEER_URLS \ --skip-hash-check=true
4.5) Move the restore member data from /var/lib/etcd/restore to /var/lib/etcd
# mv /var/lib/etcd/restore/member /var/lib/etcd
# rm -r /var/lib/etcd/restore
4.6) Restore the context of /var/lib/etcd.
# restorecon -Rv /var/lib/etcd
NOTE: Follow the above steps (From 1 to 4.6) in every master. Then proceed with step 5 and later
5) Restart the services
5.1) Once restored, start the etcd static pod on each member one by one. Do not start the master-api and master-controllers services yet.
Do this step quickly in every master
# mv /etc/origin/node/pods-stopped/etcd.yaml /etc/origin/node/pods/
5.2) Check if the etcd containers are started - it should show 2 containers. It can take some time to start.
# docker ps | grep master-etcd
5.3) If the containers did not start after several, try restarting atomic-openshift-node service. If some minutes later etcd containers still don't start, look at the atomic-openshift-node logs to troubleshoot why.
5.4) Confirm health of etcd.
# ETCD_ALL_ENDPOINTS=` etcdctl3 --write-out=fields member list | awk '/ClientURL/{printf "%s%s",sep,$3; sep=","}'`
# etcdctl3 --endpoints=$ETCD_ALL_ENDPOINTS endpoint status --write-out=table
5.5) Start the master-api and master-controllers static pods on each etcd member.
# mv /etc/origin/node/pods-stopped/* /etc/origin/node/pods/
5.6) Check the health of the cluster.
# oc get nodes,pods -n kube-system
Root Cause
Running the restore from the guide will always end up in success as all etcd members are restored to the same replica of data.
If the cluster is not healthy for some case, verify that command were executed successfully. If cluster won't start, contact Red Hat Support.
Diagnostic Steps
Example: Run through with 3 etcd hosts.
master1.etcd.com
master2.etcd.com
master3.etcd.com
# ssh master1.etcd.com
# ETCD_ALL_ENDPOINTS=` etcdctl3 --write-out=fields member list | awk '/ClientURL/{printf "%s%s",sep,$3; sep=","}'`
# etcdctl3 --endpoints=$ETCD_ALL_ENDPOINTS endpoint status --write-out=table
+-----------------------------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://master1.etcd.com:2379 | d91b1c20df818655 | 3.2.22 | 17 MB | true | 6 | 42 |
| https://10.0.88.33:2379 | d35cfd2fedc078f | 3.2.22 | 17 MB | false | 6 | 42 |
| https://10.0.88.22:2379 | c9624828ed10ae36 | 3.2.22 | 17 MB | false | 6 | 42 |
| https://10.0.88.11:2379 | d91b1c20df818655 | 3.2.22 | 17 MB | true | 6 | 42 |
+-----------------------------------+------------------+---------+---------+-----------+-----------+------------+
# mkdir -p /etc/origin/node/pods-stopped/
# mv /etc/origin/node/pods/* /etc/origin/node/pods-stopped/
# mv /var/lib/etcd/member /var/lib/etcd/etcd-backup-$(date +%d-%m-%y)
# ssh master1.etcd.com
# source /etc/etcd/etcd.conf
# export ETCDCTL_API=3
# echo -e "$ETCD_INITIAL_CLUSTER \n$ETCD_INITIAL_CLUSTER_TOKEN"
master1.etcd.com=https://10.0.88.11:2380,master2.etcd.com=https://10.0.88.22:2380,master3.etcd.com=https://10.0.88.33:2380
etcd-cluster-1
# ETCDCTL_API=3 etcdctl snapshot restore /tmp/snapshot.db \
--name master1.etcd.com \
--initial-cluster master1.etcd.com=https://10.0.88.11:2380,master2.etcd.com=https://10.0.88.22:2380,master3.etcd.com=https://10.0.88.33:2380 \
--initial-cluster-token etcd-cluster-1 \
--initial-advertise-peer-urls https://10.0.88.11:2380 \
--data-dir /var/lib/etcd/restore
2019-02-05 12:49:04.103233 I | mvcc: restore compact to 2361744
2019-02-05 12:49:04.135995 I | etcdserver/membership: added member d35cfd2fedc078f [https://10.0.88.33:2380] to cluster 1a196dd3442fbe59
2019-02-05 12:49:04.136161 I | etcdserver/membership: added member c9624828ed10ae36 [https://10.0.88.22:2380] to cluster 1a196dd3442fbe59
2019-02-05 12:49:04.136267 I | etcdserver/membership: added member d91b1c20df818655 [https://10.0.88.11:2380] to cluster 1a196dd3442fbe59
# mv /var/lib/etcd/restore/member /var/lib/etcd
# restorecon -Rv /var/lib/etcd
# ssh master2.etcd.com
# ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
--name master2.etcd.com \
--initial-cluster master1.etcd.com=https://10.0.88.11:2380,master2.etcd.com=https://10.0.88.22:2380,master3.etcd.com=https://10.0.88.33:2380 \
--initial-cluster-token etcd-cluster-1 \
--initial-advertise-peer-urls https://10.0.88.22:2380 \
--data-dir /var/lib/etcd/restore
2019-02-05 12:51:25.179801 I | mvcc: restore compact to 2356950
2019-02-05 12:51:25.193709 I | etcdserver/membership: added member d35cfd2fedc078f [https://10.0.88.33:2380] to cluster 1a196dd3442fbe59
2019-02-05 12:51:25.193745 I | etcdserver/membership: added member c9624828ed10ae36 [https://10.0.88.22:2380] to cluster 1a196dd3442fbe59
2019-02-05 12:51:25.193759 I | etcdserver/membership: added member d91b1c20df818655 [https://10.0.88.11:2380] to cluster 1a196dd3442fbe59
# mv /var/lib/etcd/restore/member /var/lib/etcd
# restorecon -Rv /var/lib/etcd
# ssh master3.etcd.com
# ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
--name master3.etcd.com \
--initial-cluster master1.etcd.com=https://10.0.88.11:2380,master2.etcd.com=https://10.0.88.22:2380,master3.etcd.com=https://10.0.88.33:2380 \
--initial-cluster-token etcd-cluster-1 \
--initial-advertise-peer-urls https://10.0.88.33:2380 \
--data-dir /var/lib/etcd/restore
2019-02-05 12:53:06.612149 I | mvcc: restore compact to 2356950
2019-02-05 12:53:06.634761 I | etcdserver/membership: added member d35cfd2fedc078f [https://10.0.88.33:2380] to cluster 1a196dd3442fbe59
2019-02-05 12:53:06.634905 I | etcdserver/membership: added member c9624828ed10ae36 [https://10.0.88.22:2380] to cluster 1a196dd3442fbe59
2019-02-05 12:53:06.635001 I | etcdserver/membership: added member d91b1c20df818655 [https://10.0.88.11:2380] to cluster 1a196dd3442fbe59
# mv /var/lib/etcd/restore/member /var/lib/etcd
# restorecon -Rv /var/lib/etcd
# ssh master1.etcd.com
# mv /etc/origin/node/pods-stopped/etcd.yaml /etc/origin/node/pods/
# ssh master2.etcd.com
# mv /etc/origin/node/pods-stopped/etcd.yaml /etc/origin/node/pods/
# ssh master3.etcd.com
# mv /etc/origin/node/pods-stopped/etcd.yaml /etc/origin/node/pods/
# ssh master1.etcd.com
# ETCD_ALL_ENDPOINTS=` etcdctl3 --write-out=fields member list | awk '/ClientURL/{printf "%s%s",sep,$3; sep=","}'`
# etcdctl3 --endpoints=$ETCD_ALL_ENDPOINTS endpoint status --write-out=table
+-----------------------------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://master1.etcd.com:2379 | d91b1c20df818655 | 3.2.22 | 17 MB | true | 6 | 42 |
| https://10.0.88.33:2379 | d35cfd2fedc078f | 3.2.22 | 17 MB | false | 6 | 42 |
| https://10.0.88.22:2379 | c9624828ed10ae36 | 3.2.22 | 17 MB | false | 6 | 42 |
| https://10.0.88.11:2379 | d91b1c20df818655 | 3.2.22 | 17 MB | true | 6 | 42 |
+-----------------------------------+------------------+---------+---------+-----------+-----------+------------+
# ssh master1.etcd.com
# mv /etc/origin/node/pods-stopped/* /etc/origin/node/pods/
# ssh master2.etcd.com
# mv /etc/origin/node/pods-stopped/* /etc/origin/node/pods/
# ssh master3.etcd.com
# mv /etc/origin/node/pods-stopped/* /etc/origin/node/pods/
# oc get nodes,pods -n kube-system
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.