etcd pod is failing to start after updating to OpenShift Container Platform 4.9.28 or 4.10.9
Environment
- Red Hat OpenShift Container Platform (OCP)
- 4.9.28 or later
- 4.10.9 or later
Issue
- After updating to OCP 4.9.28 or 4.10.9 (or later) one
etcdpod is failing to start and theetcd operatoris in adegradedstate. etcdpod is failing to start and reportingfound data inconsistency with peers.
Resolution
- The
--experimental-initial-corrupt-check=trueflag on theetcdpod may prevent theetcdpod from starting and avoidingetcddata inconsistency as per Potential etcd data inconsistency issue in OCP 4.9 and 4.10 - To recover this faulty
etcdmember, the problematicetcdmember needs to be replaced, following Replacing an unhealthy etcd member documentation.- In the exceptional case where
etcdwill lose quorum due to the above documented problem and procedure, follow Restoring to a previous cluster state to restore the cluster to the previously known working state.
- In the exceptional case where
- If you encounter this issues with these procedures, please contact Red Hat Technical Support and / or open a new This content is not included.Support Case in the Red Hat Customer Portal.
Root Cause
- With Red Hat OpenShift Container Platform 4.9.28 and 4.10.9, Red Hat introduces the
--experimental-initial-corrupt-check=trueflag foretcdto detectetcdmembers that may have a corruptetcddatabase.- The
--experimental-initial-corrupt-check=trueflag may prevent problematicetcdmembers from starting and will triggerfound data inconsistency with peersmessages to be reported in it's logs. It will also set theetcdCluster Operator todegradedstate because of the faultyetcdmember.
- The
SBR
Product(s)
Components
Category
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.