Potential etcd data inconsistency issue in OCP 4.9 and 4.10
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4.9.0 through 4.9.32
- 4.10.0 through 4.10.12
Issue
- Red Hat has become aware of a potential data inconsistency issue present in
etcd 3.5.0-3.5.2. - OpenShift Container Platform (OCP) versions 4.9.0 through 4.9.32 and 4.10.0 through 4.10.12 utilize the affected versions of
etcd. - The following questions may arise due to this:
- What should I do until a fix has been delivered?
- We have upgraded from 4.8 to 4.9. Are we at risk of any
etcdproblems? - Does my cluster have inconsistent
etcddata? - Are 4.9.z, 4.9 to 4.10, or 4.10.z updates impacted by this?
Resolution
- Starting with Red Hat OpenShift Container Platform 4.9.33 and 4.10.13, the fix for the overall root cause of the inconsistency was made available in Red Hat OpenShift Container Platform. It's recommend to update to Red Hat OpenShift Container Platform 4.9.33 or 4.10.13 or later to prevent this known condition in previous version of
etcd3.5 from happening. - With Red Hat OpenShift Container Platform 4.9.28 and 4.10.9,
etcdis now configured to use the--experimental-initial-corrupt-check=trueflag, which may preventetcdmembers with inconsistent data from joining theetcdcluster. Please review etcd pod is failing to start after updating to OpenShift Container Platform 4.9.28 or 4.10.9 for more information when a problematicetcdmember is identified.- Note: While the above mitigation, restores the Red Hat OpenShift Container Platform operational capabilities (through detection); it doesn't address the root cause of the issue. As Red Hat is planning to address the root cause of the data corruption in a later version of Red Hat OpenShift Container Platform 4, it's strongly recommended to update to Red Hat OpenShift Container Platform 4.9.28 or 4.10.9 or later to implement the current available mitigation for the problem.
- If you encounter this issues with these procedures, please contact Red Hat Technical Support and / or open a new This content is not included.Support Case in the Red Hat Customer Portal.
Root Cause
- The issue occurs when the
etcdprocess is shutdown in an uncontrolled manner and is operating under high load. During normal operations, clusters are at limited risk of encountering this issue. - The
kill -9command and Out Of Memory (OOM) kills are canonical examples, as well as OpenShift Container Platform Control Plane Nodes running out of available memory or disk space.- Note: Controlled shutdowns do not result in data inconsistency.
- Red Hat recommends ensuring that you have sufficient memory / storage on control plane nodes to avoid this issue until other mitigation is available.
- Recommended control plane node size information can be found in This page is not included, but the link has been rewritten to point to the nearest parent document.Control plane node sizing - documentation.
- If necessary, scale up control plane hosts following procedures outlined in Solution 5597381.
- For more technical details, please refer to the Content from github.com is not included.upstream issue and the following Bug: This content is not included.RHBZ#2068601.
Diagnostic Steps
-
Check
etcdctl endpoint statusfor the revision information and see if they differ between members.-
One indicator that there might be corruption is a large difference in revision number. It is normal for some
etcdmembers to lag behind the revision by a few revisions (fewer than 10); if one gets behind, it should catch up to the leader. If a member's revision is largely different from the leader/other members, then it could be a sign that divergence is starting.$ oc get pods -n openshift-etcd | grep etcd $ oc rsh -n openshift-etcd [etcd-pod] sh-4.2# etcdctl endpoint status -w fields --cluster # Fields (or JSON) format shows the revision.
-
-
Another sign could be, if continually failing leader election of operator pods is being obersved such as the example below.
error retrieving resource lock xxxxxx/fc7f2af9.xxx.com: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/xxxxx/leases/fc7f2af9.xxx.com": context deadline exceeded failed to renew lease xxxxx/fc7f2af9.xxx.com: timed out waiting for the condition -
Note: Please check the events on ConfigMaps and or Leases (look for "reason": "LeaderElection") as this can provide indicators that
etcdisn't behaving properly. -
Note: Leader Elections are expected with operators due to how Content from sdk.operatorframework.io is not included.control leases work. However this shouldn't be happening constantly on/with a large number of operators at the same time.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.