Recover ETCD quorum guard pod after a failing OpenShift 4 update
Environment
- Red Hat OpenShift Container Platform (OCP)
- 4.6 and newer
Issue
- The "master"
MachineConfigPoolis stuck in "Updating" phase - All
ETCDcluster members are up and running, but one of theETCDquorum guard pods does not pass the health check
Resolution
- Verify all 3
ETCDcluster members are running, if yes go to next step.
$ for pod in $(oc -n openshift-etcd get pod -o name -l app=etcd); do echo "${pod}:"; oc -n openshift-etcd rsh -c etcdctl ${pod} bash -c "etcdctl endpoint health; etcdctl endpoint status;etcdctl alarm list"; echo; done
pod/etcd-master-0.example.net:
https://10.0.0.223:2379 is healthy: successfully committed proposal: took = 12.042072ms
https://10.0.0.171:2379 is healthy: successfully committed proposal: took = 12.665521ms
https://10.0.0.83:2379 is healthy: successfully committed proposal: took = 17.806308ms
https://10.0.0.83:2379, f018a26a8c828e3b, 3.4.9, 82 MB, false, false, 14, 573347, 573347,
https://10.0.0.223:2379, d4df83c01e2229ce, 3.4.9, 83 MB, true, false, 14, 573347, 573347,
https://10.0.0.171:2379, 2b9c3abf8789cf11, 3.4.9, 82 MB, false, false, 14, 573348, 573348,
pod/etcd-master-1.example.net:
https://10.0.0.223:2379 is healthy: successfully committed proposal: took = 20.093043ms
https://10.0.0.83:2379 is healthy: successfully committed proposal: took = 22.037383ms
https://10.0.0.171:2379 is healthy: successfully committed proposal: took = 31.635692ms
https://10.0.0.83:2379, f018a26a8c828e3b, 3.4.9, 82 MB, false, false, 14, 573354, 573354,
https://10.0.0.223:2379, d4df83c01e2229ce, 3.4.9, 83 MB, true, false, 14, 573354, 573354,
https://10.0.0.171:2379, 2b9c3abf8789cf11, 3.4.9, 82 MB, false, false, 14, 573354, 573354,
pod/etcd-master-2.example.net:
https://10.0.0.171:2379 is healthy: successfully committed proposal: took = 14.754869ms
https://10.0.0.223:2379 is healthy: successfully committed proposal: took = 14.584083ms
https://10.0.0.83:2379 is healthy: successfully committed proposal: took = 15.782309ms
https://10.0.0.83:2379, f018a26a8c828e3b, 3.4.9, 82 MB, false, false, 14, 573356, 573356,
https://10.0.0.223:2379, d4df83c01e2229ce, 3.4.9, 83 MB, true, false, 14, 573356, 573356,
https://10.0.0.171:2379, 2b9c3abf8789cf11, 3.4.9, 82 MB, false, false, 14, 573356, 573356,
- Delete all 3
ETCDquorum guard pod so that they are rescheduled:
$ oc project openshift-etcd
$ oc get pod -l name=etcd-quorum-guard -o wide
$ oc delete pod -l name=etcd-quorum-guard
- Observe if
ETCDquorum guard pod recovers with accurate liveness reporting:
$ oc get pod -l name=etcd-quorum-guard -o wide --->> check if all ETCD quorum guard pods are up and running
$ oc logs pod <one quorum guard pod>
If none of these action points helps, please open a support case with Red Hat.
Diagnostic Steps
MachineConfigPoolfor "master" nodes is stuck in Updating phase:
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-1789bade89c7972eb0d38560937a486d False True False 3 0 0 0 9d
...
- One of the master nodes is in
SchedulingDisableddue to the stuck upgrade:
$ oc get node -l node-role.kubernetes.io/master
NAME STATUS ROLES AGE VERSION
master-0.example.net Ready master 25h v1.19.16+8203b20
master-1.example.net Ready,SchedulingDisabled master 26h v1.19.16+8203b20
master-2.example.net Ready master 25h v1.19.16+8203b20
...
ETCDcluster looks healthy:
$ for pod in $(oc -n openshift-etcd get pod -o name -l app=etcd); do echo "${pod}:"; oc -n openshift-etcd rsh -c etcdctl ${pod} bash -c "etcdctl endpoint health; etcdctl endpoint status;etcdctl alarm list"; echo; done
pod/etcd-master-0.example.net:
https://10.0.0.223:2379 is healthy: successfully committed proposal: took = 12.042072ms
https://10.0.0.171:2379 is healthy: successfully committed proposal: took = 12.665521ms
https://10.0.0.83:2379 is healthy: successfully committed proposal: took = 17.806308ms
https://10.0.0.83:2379, f018a26a8c828e3b, 3.4.9, 82 MB, false, false, 14, 573347, 573347,
https://10.0.0.223:2379, d4df83c01e2229ce, 3.4.9, 83 MB, true, false, 14, 573347, 573347,
https://10.0.0.171:2379, 2b9c3abf8789cf11, 3.4.9, 82 MB, false, false, 14, 573348, 573348,
pod/etcd-master-1.example.net:
https://10.0.0.223:2379 is healthy: successfully committed proposal: took = 20.093043ms
https://10.0.0.83:2379 is healthy: successfully committed proposal: took = 22.037383ms
https://10.0.0.171:2379 is healthy: successfully committed proposal: took = 31.635692ms
https://10.0.0.83:2379, f018a26a8c828e3b, 3.4.9, 82 MB, false, false, 14, 573354, 573354,
https://10.0.0.223:2379, d4df83c01e2229ce, 3.4.9, 83 MB, true, false, 14, 573354, 573354,
https://10.0.0.171:2379, 2b9c3abf8789cf11, 3.4.9, 82 MB, false, false, 14, 573354, 573354,
pod/etcd-master-2.example.net:
https://10.0.0.171:2379 is healthy: successfully committed proposal: took = 14.754869ms
https://10.0.0.223:2379 is healthy: successfully committed proposal: took = 14.584083ms
https://10.0.0.83:2379 is healthy: successfully committed proposal: took = 15.782309ms
https://10.0.0.83:2379, f018a26a8c828e3b, 3.4.9, 82 MB, false, false, 14, 573356, 573356,
https://10.0.0.223:2379, d4df83c01e2229ce, 3.4.9, 83 MB, true, false, 14, 573356, 573356,
https://10.0.0.171:2379, 2b9c3abf8789cf11, 3.4.9, 82 MB, false, false, 14, 573356, 573356,
but one of the ETCD quorum guard pods is NotReady:
$ oc get deployment etcd-quorum-guard -n openshift-etcd
NAME READY UP-TO-DATE AVAILABLE AGE
etcd-quorum-guard 2/3 3 2 9d
$
The machine-config-daemon pod for the master node in "SchedingDisabled" state is showing:
$ oc logs <machine-config-daemon pod> -n openshift-machine-config-operator
...
2020-06-04T09:41:29.902463113Z I0604 09:41:29.902411 2103012 update.go:89] error when evicting pod "etcd-quorum-guard-6d9d749bd6-ll9pm" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
...
SBR
Product(s)
Components
Category
Tags
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.