Recover ETCD quorum guard pod after a failing OpenShift 4 update

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (OCP)
    • 4.6 and newer

Issue

  • The "master" MachineConfigPool is stuck in "Updating" phase
  • All ETCD cluster members are up and running, but one of the ETCD quorum guard pods does not pass the health check

Resolution

  • Verify all 3 ETCD cluster members are running, if yes go to next step.
$ for pod in $(oc -n openshift-etcd get pod -o name -l app=etcd); do echo "${pod}:"; oc -n openshift-etcd rsh -c etcdctl ${pod} bash -c "etcdctl endpoint health; etcdctl endpoint status;etcdctl alarm list"; echo; done
pod/etcd-master-0.example.net:
https://10.0.0.223:2379 is healthy: successfully committed proposal: took = 12.042072ms
https://10.0.0.171:2379 is healthy: successfully committed proposal: took = 12.665521ms
https://10.0.0.83:2379 is healthy: successfully committed proposal: took = 17.806308ms
https://10.0.0.83:2379, f018a26a8c828e3b, 3.4.9, 82 MB, false, false, 14, 573347, 573347, 
https://10.0.0.223:2379, d4df83c01e2229ce, 3.4.9, 83 MB, true, false, 14, 573347, 573347, 
https://10.0.0.171:2379, 2b9c3abf8789cf11, 3.4.9, 82 MB, false, false, 14, 573348, 573348, 

pod/etcd-master-1.example.net:
https://10.0.0.223:2379 is healthy: successfully committed proposal: took = 20.093043ms
https://10.0.0.83:2379 is healthy: successfully committed proposal: took = 22.037383ms
https://10.0.0.171:2379 is healthy: successfully committed proposal: took = 31.635692ms
https://10.0.0.83:2379, f018a26a8c828e3b, 3.4.9, 82 MB, false, false, 14, 573354, 573354, 
https://10.0.0.223:2379, d4df83c01e2229ce, 3.4.9, 83 MB, true, false, 14, 573354, 573354, 
https://10.0.0.171:2379, 2b9c3abf8789cf11, 3.4.9, 82 MB, false, false, 14, 573354, 573354, 

pod/etcd-master-2.example.net:
https://10.0.0.171:2379 is healthy: successfully committed proposal: took = 14.754869ms
https://10.0.0.223:2379 is healthy: successfully committed proposal: took = 14.584083ms
https://10.0.0.83:2379 is healthy: successfully committed proposal: took = 15.782309ms
https://10.0.0.83:2379, f018a26a8c828e3b, 3.4.9, 82 MB, false, false, 14, 573356, 573356, 
https://10.0.0.223:2379, d4df83c01e2229ce, 3.4.9, 83 MB, true, false, 14, 573356, 573356, 
https://10.0.0.171:2379, 2b9c3abf8789cf11, 3.4.9, 82 MB, false, false, 14, 573356, 573356,
  • Delete all 3 ETCD quorum guard pod so that they are rescheduled:
$ oc project openshift-etcd
$ oc get pod -l name=etcd-quorum-guard -o wide
$ oc delete pod -l name=etcd-quorum-guard
  • Observe if ETCD quorum guard pod recovers with accurate liveness reporting:
$ oc get pod -l name=etcd-quorum-guard -o wide --->> check if all ETCD quorum guard pods are up and running
$ oc logs pod <one quorum guard pod> 

If none of these action points helps, please open a support case with Red Hat.

Diagnostic Steps

  • MachineConfigPool for "master" nodes is stuck in Updating phase:
$ oc get mcp
NAME    CONFIG                                            UPDATED  UPDATING  DEGRADED  MACHINECOUNT  READYMACHINECOUNT  UPDATEDMACHINECOUNT  DEGRADEDMACHINECOUNT  AGE
master  rendered-master-1789bade89c7972eb0d38560937a486d  False    True      False     3             0                  0                    0                     9d
...
  • One of the master nodes is in SchedulingDisabled due to the stuck upgrade:
$ oc get node -l node-role.kubernetes.io/master
NAME                  STATUS                    ROLES   AGE  VERSION
master-0.example.net  Ready                     master  25h  v1.19.16+8203b20
master-1.example.net  Ready,SchedulingDisabled  master  26h  v1.19.16+8203b20
master-2.example.net  Ready                     master  25h  v1.19.16+8203b20
...
  • ETCD cluster looks healthy:
$ for pod in $(oc -n openshift-etcd get pod -o name -l app=etcd); do echo "${pod}:"; oc -n openshift-etcd rsh -c etcdctl ${pod} bash -c "etcdctl endpoint health; etcdctl endpoint status;etcdctl alarm list"; echo; done
pod/etcd-master-0.example.net:
https://10.0.0.223:2379 is healthy: successfully committed proposal: took = 12.042072ms
https://10.0.0.171:2379 is healthy: successfully committed proposal: took = 12.665521ms
https://10.0.0.83:2379 is healthy: successfully committed proposal: took = 17.806308ms
https://10.0.0.83:2379, f018a26a8c828e3b, 3.4.9, 82 MB, false, false, 14, 573347, 573347, 
https://10.0.0.223:2379, d4df83c01e2229ce, 3.4.9, 83 MB, true, false, 14, 573347, 573347, 
https://10.0.0.171:2379, 2b9c3abf8789cf11, 3.4.9, 82 MB, false, false, 14, 573348, 573348, 

pod/etcd-master-1.example.net:
https://10.0.0.223:2379 is healthy: successfully committed proposal: took = 20.093043ms
https://10.0.0.83:2379 is healthy: successfully committed proposal: took = 22.037383ms
https://10.0.0.171:2379 is healthy: successfully committed proposal: took = 31.635692ms
https://10.0.0.83:2379, f018a26a8c828e3b, 3.4.9, 82 MB, false, false, 14, 573354, 573354, 
https://10.0.0.223:2379, d4df83c01e2229ce, 3.4.9, 83 MB, true, false, 14, 573354, 573354, 
https://10.0.0.171:2379, 2b9c3abf8789cf11, 3.4.9, 82 MB, false, false, 14, 573354, 573354, 

pod/etcd-master-2.example.net:
https://10.0.0.171:2379 is healthy: successfully committed proposal: took = 14.754869ms
https://10.0.0.223:2379 is healthy: successfully committed proposal: took = 14.584083ms
https://10.0.0.83:2379 is healthy: successfully committed proposal: took = 15.782309ms
https://10.0.0.83:2379, f018a26a8c828e3b, 3.4.9, 82 MB, false, false, 14, 573356, 573356, 
https://10.0.0.223:2379, d4df83c01e2229ce, 3.4.9, 83 MB, true, false, 14, 573356, 573356, 
https://10.0.0.171:2379, 2b9c3abf8789cf11, 3.4.9, 82 MB, false, false, 14, 573356, 573356,

but one of the ETCD quorum guard pods is NotReady:

$ oc get deployment etcd-quorum-guard  -n openshift-etcd
NAME               READY  UP-TO-DATE  AVAILABLE  AGE
etcd-quorum-guard  2/3    3           2          9d
$

The machine-config-daemon pod for the master node in "SchedingDisabled" state is showing:

$ oc logs <machine-config-daemon pod> -n openshift-machine-config-operator
...
2020-06-04T09:41:29.902463113Z I0604 09:41:29.902411 2103012 update.go:89] error when evicting pod "etcd-quorum-guard-6d9d749bd6-ll9pm" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
...
SBR
Components
Category
Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.