Potential etcd data inconsistency issue in OCP 4.9 and 4.10

Solution Verified - Updated 13 Jun 2024

Environment

Red Hat OpenShift Container Platform (RHOCP)
- 4.9.0 through 4.9.32
- 4.10.0 through 4.10.12

Issue

Red Hat has become aware of a potential data inconsistency issue present in etcd 3.5.0-3.5.2.
OpenShift Container Platform (OCP) versions 4.9.0 through 4.9.32 and 4.10.0 through 4.10.12 utilize the affected versions of etcd.
The following questions may arise due to this:
1. What should I do until a fix has been delivered?
2. We have upgraded from 4.8 to 4.9. Are we at risk of any etcd problems?
3. Does my cluster have inconsistent etcd data?
4. Are 4.9.z, 4.9 to 4.10, or 4.10.z updates impacted by this?

Resolution

Starting with Red Hat OpenShift Container Platform 4.9.33 and 4.10.13, the fix for the overall root cause of the inconsistency was made available in Red Hat OpenShift Container Platform. It's recommend to update to Red Hat OpenShift Container Platform 4.9.33 or 4.10.13 or later to prevent this known condition in previous version of etcd 3.5 from happening.
- Note: The flag --experimental-initial-corrupt-check=true added in OpenShift Container Platform 4.9.28 and 4.10.9 will remain in place to protect etcd from potential inconsistency situation in future version of Red Hat OpenShift Container Platform 4.
With Red Hat OpenShift Container Platform 4.9.28 and 4.10.9, etcd is now configured to use the --experimental-initial-corrupt-check=true flag, which may prevent etcd members with inconsistent data from joining the etcd cluster. Please review etcd pod is failing to start after updating to OpenShift Container Platform 4.9.28 or 4.10.9 for more information when a problematic etcd member is identified.
- Note: While the above mitigation, restores the Red Hat OpenShift Container Platform operational capabilities (through detection); it doesn't address the root cause of the issue. As Red Hat is planning to address the root cause of the data corruption in a later version of Red Hat OpenShift Container Platform 4, it's strongly recommended to update to Red Hat OpenShift Container Platform 4.9.28 or 4.10.9 or later to implement the current available mitigation for the problem.
If you encounter this issues with these procedures, please contact Red Hat Technical Support and / or open a new This content is not included.Support Case in the Red Hat Customer Portal.

Root Cause

The issue occurs when the etcd process is shutdown in an uncontrolled manner and is operating under high load. During normal operations, clusters are at limited risk of encountering this issue.
The kill -9 command and Out Of Memory (OOM) kills are canonical examples, as well as OpenShift Container Platform Control Plane Nodes running out of available memory or disk space.
- Note: Controlled shutdowns do not result in data inconsistency.
Red Hat recommends ensuring that you have sufficient memory / storage on control plane nodes to avoid this issue until other mitigation is available.
- Recommended control plane node size information can be found in This page is not included, but the link has been rewritten to point to the nearest parent document.Control plane node sizing - documentation.
- If necessary, scale up control plane hosts following procedures outlined in Solution 5597381.

For more technical details, please refer to the Content from github.com is not included.upstream issue and the following Bug: This content is not included.RHBZ#2068601.

Diagnostic Steps

Check etcdctl endpoint status for the revision information and see if they differ between members.
- One indicator that there might be corruption is a large difference in revision number. It is normal for some etcd members to lag behind the revision by a few revisions (fewer than 10); if one gets behind, it should catch up to the leader. If a member's revision is largely different from the leader/other members, then it could be a sign that divergence is starting.
```
$ oc get pods -n openshift-etcd | grep etcd
$ oc rsh  -n openshift-etcd [etcd-pod]
sh-4.2# etcdctl endpoint status -w fields --cluster # Fields (or JSON) format shows the revision.
```

Another sign could be, if continually failing leader election of operator pods is being obersved such as the example below.

error retrieving resource lock xxxxxx/fc7f2af9.xxx.com: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/xxxxx/leases/fc7f2af9.xxx.com": context deadline exceeded
failed to renew lease xxxxx/fc7f2af9.xxx.com: timed out waiting for the condition

Note: Please check the events on ConfigMaps and or Leases (look for "reason": "LeaderElection") as this can provide indicators that etcd isn't behaving properly.
Note: Leader Elections are expected with operators due to how Content from sdk.operatorframework.io is not included.control leases work. However this shouldn't be happening constantly on/with a large number of operators at the same time.

SBR

Shift

Product(s)

Red Hat OpenShift Container Platform

Components

Tags

ocp_4

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.