How do I remove and add back an existing etcd member for the OpenShift cluster?

Solution Verified - Updated

Environment

OpenShift Enterprise Container Platform 3

  • 3.7
  • 3.9
  • 3.10
  • 3.11

Issue

  • One member is failing with cluster ID mismatch.
  • The data in /var/lib/etcd on one etcd member was corrupted or lost.

Resolution

If the cluster is not healthy a restore of the whole cluster from backup might be needed see the following link for steps to restore etcd.

If the cluster is still healthly but only one member is failing due to ID or data issues the easiest approach to correct the issue is to remove the member and add it back to the cluster after wiping its etcd storage data.

From an etcd host that is healthy, source the configurations for etcd:

# source /etc/etcd/etcd.conf
# export ETCDCTL_API=3

Next, set the ETCD_ALL_ENDPOINTS variable. This is done one of two ways depending on if you are using containerized etcd (OpenShift Container Platform 3.10 and 3.11) or the RPM based etcd:

  • 3.10 and 3.11
# ETCD_ALL_ENDPOINTS=$(etcdctl3 --write-out=fields   member list | awk '/ClientURL/{printf "%s%s",sep,$3; sep=","}')
  • 3.9 and before
# ETCD_ALL_ENDPOINTS=$(etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS --write-out=fields   member list | awk '/ClientURL/{printf "%s%s",sep,$3; sep=","}')

Confirm cluster health and member health:

  • 3.11 and earlier with etcd v3 protocol
# etcdctl3 --write-out=table  member list

Get the ID of the member that is unhealthy (commands are for the etcd v3 protocol):

# etcdctl3 --endpoints=$ETCD_ALL_ENDPOINTS --write-out=table  endpoint status

Remove the etcd member from the cluster:

# etcdctl3 member remove  <memberID>

Remote to the removed etcd member, stop the etcd service and wipe the etcd data:

  • 3.10 - 3.11 stop the pod
# mkdir -p /etc/origin/node/pods-stopped
# mv /etc/origin/node/pods/etcd.yaml /etc/origin/node/pods-stopped/
  • 3.7 - 3.9
stop the etcd service 
# systemctl  stop etcd 
  • Wipe the etcd data
# rm -Rf /var/lib/etcd/*

Remote to a healthly etcd member and run the following command to add the removed member back to the cluster.
- memberName is the hostname of the removed member
- memberIP is the IP of the removed member

# etcdctl3 member add <memberName> --peer-urls="https://<memberIP>:2380"

After the member add command is run, 3 lines will be outputted to stdout. On the newly added member make sure that the values in /etc/etcd/etcd.conf match the output. If they do not match, make changes to /etc/etcd/etcd.conf to match the outputted lines.

ETCD_NAME="<membername>"
ETCD_INITIAL_CLUSTER="<membername>=https://<ip>:2380, . . ."
ETCD_INITIAL_CLUSTER_STATE="existing"

On the newly added member start the etcd service again:

  • 3.10 - 3.11 stop the pod
# mv /etc/origin/node/pods-stopped/etcd.yaml /etc/origin/node/pods/
  • 3.7 - 3.9 start the etcd service
# systemctl  start etcd 

Confirm the health of the cluster and members:

# etcdctl3 --write-out=table  member list
# ETCD_ALL_ENDPOINTS=` etcdctl3 --write-out=fields   member list | awk '/ClientURL/{printf "%s%s",sep,$3; sep=","}'`

# etcdctl3 --endpoints=$ETCD_ALL_ENDPOINTS endpoint health
SBR

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.