Consolidated Article for etcd guidelines with OpenShift Container Platform 4

Updated

Disclaimers:
The following guidelines are based in official Red Hat Documentation and KB articles, and intent to provide information to accomplish the system's needs. It must be reviewed, monitored and revisited by customers and partners according to its own business requirements, application workloads and new demands.

Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.

Table of Contents

Abstract

etcd is the key-value store for Red Hat OpenShift Container Platform and Kubernetes, which persists the state of all resource objects. It is a critical component in the OpenShift Container Platform control plane.

It is also a CNCF project classified as Content from www.cncf.io is not included.Graduated maturity level, which, also according to CNCF, is considered stable and ready to be used in production.

Although it's current stability for production systems, sizing and monitoring your control plane nodes is key for successful Red Hat OpenShift Container Platform installations.

This article is intended to summarize Red Hat’s guidelines for etcd implementation and index the most important Red Hat’s KB solutions and articles about this subject.

Guidelines

  • Leverage the embedded metrics, dashboards and alerts delivered together with Red Hat Openshift Container Platform Monitoring stacks
    • Keeping systems under the pre-built recommended thresholds contribute to keep clusters in a healthy state.
    • One single metric will not bring the full picture of the system. Additional metrics and cluster state, through logs and API/CLI calls, must be also taken into consideration while analyzing RHOCP cluster’s healthy.
    • In addition to embbeded dashboards, Red Hat OpenShift Container Platform allows you querying metrics through Prometheus Query Language (PromQL) for advanced throubleshooting propouses.
  • Each workload can cause different effects into etcd. As new workloads are added to RHOCP clusters, the capacity of the RHOCP control plane nodes and the effect over etcd must be reevaluated.
    • How much and how often systems can be above that defined thresholds during install or during daily operations, is unpredictable.
  • Do not share control plane / etcd drives with non-control plane workloads. Although it is a supported configuration, control plane drives must not coexist with applications or other infrastructures.
    • Providing support does not indicate function. For example, during support activities, if Red Hat Support detects that etcd is underperforming or is negatively impacted, Red Hat Support will make a clear statement to discuss options. Customers and partners must reproduce the issue in a suggested configuration with etcd in a healthy state. In these situations, Red Hat provides support only on a commercially reasonable support basis. Customers and partners must resolve the underlying infrastructure bottleneck, and Red Hat provides only general advice by using our documentation, articles, and solutions.
  • Use dedicated SSD/NVMe drives for Master/control plane functions. According to your workload, you may also need to dedicate SSD/NVMe drive for etcd.
  • etcd's Backend storage device should require from 1500-2000 sequential IOPS, with appropriate response time, for normal operations, but could require more IOPS for heavy loaded clusters as etcd holds more objects.
    • See the session Metrics I/O Metrics for response time metrics and KB 6271341 for additional trobleshooting details.
    • etcd is a write intensive workload. NVMe and Write intensive SSD drives are recommended drive types to this kind of workload.
  • Size the platform according to application requirements, workload, and following Red Hat recommendations.
  • Benchmark tools like fio and etcd-perf can be used as baseline for the first stage of a sizing study, but are short tests executed at a specific moment and will not give an overall view of how performance behaves over several days. In a live cluster, only the metrics will confirm if the hardware in place is suitable for your workload in place.

Metrics

  • Monitor Leadership changes:
    • This is expected as per result of installation/upgrade process or day1/2 operations (as result of Machine Config daemon operations), but we don't expect to see it happening during normal operations.
    • etcdHighNumberOfLeaderChanges alert can help us to identify that situation.
    • Prometheus query could also be used (sum(rate(etcd_server_leader_changes_seen_total[2m]))).
    • If happening during normal operation, I/O and network metrics can help us to identify the root cause.
  • I/O Metrics:
    • etcd_disk_backend_commit_duration_seconds_bucket with p99 duration less than 25ms
    • etcd_disk_wal_fsync_duration_seconds_bucket with p99 duration less than 10ms
  • Network metrics:
    • etcd_network_peer_round_trip_time_seconds_bucket with p99 duration should be less than 50ms.
      • Network RTT latency: Big network latency and packet drops can also bring an unreliable etcd cluster state, so network health values (RTT and packet drops) should be monitored.
  • etcd can also suffer poor performance if the keyspace grows excessively large and exceeds the space quota.
    • Some key metrics to monitor are:
      • etcd_server_quota_backend_bytes which is the current quota limit.
      • etcd_mvcc_db_total_size_in_use_in_bytes which indicates the actual database usage after a history compaction.
      • etcd_debugging_mvcc_db_total_size_in_bytes which shows the database size including free space waiting for defragmentation.
    • An etcd database can grow up to 8 GB.

Alerts

etcd Documentation

etcd logs

Cluster Operators State

In this particular example, this cluster is in a healthy state, and etcd has been running properly for 12 days.

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.41    True        False         12d     Cluster version is 4.8.41

$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.41    True        False         False      4d
baremetal                                  4.8.41    True        False         False      12d
cloud-credential                           4.8.41    True        False         False      12d
cluster-autoscaler                         4.8.41    True        False         False      12d
config-operator                            4.8.41    True        False         False      12d
console                                    4.8.41    True        False         False      5d
csi-snapshot-controller                    4.8.41    True        False         False      12d
dns                                        4.8.41    True        False         False      12d
etcd                                       4.8.41    True        False         False      12d <<<< ###### etcd cluster operator
image-registry                             4.8.41    True        False         False      12d
ingress                                    4.8.41    True        False         False      12d
insights                                   4.8.41    True        False         False      12d
kube-apiserver                             4.8.41    True        False         False      5d
kube-controller-manager                    4.8.41    True        False         False      12d
kube-scheduler                             4.8.41    True        False         False      12d
kube-storage-version-migrator              4.8.41    True        False         False      12d
machine-api                                4.8.41    True        False         False      12d
machine-approver                           4.8.41    True        False         False      12d
machine-config                             4.8.41    True        False         False      12d
marketplace                                4.8.41    True        False         False      12d
monitoring                                 4.8.41    True        False         False      21h
network                                    4.8.41    True        False         False      12d
node-tuning                                4.8.41    True        False         False      12d
openshift-apiserver                        4.8.41    True        False         False      5d
openshift-controller-manager               4.8.41    True        False         False      11d
openshift-samples                          4.8.41    True        False         False      12d
operator-lifecycle-manager-catalog         4.8.41    True        False         False      12d
operator-lifecycle-manager-packageserver   4.8.41    True        False         False      12d
operator-lifecycle-manager                 4.8.41    True        False         False      12d
service-ca                                 4.8.41    True        False         False      12d
storage                                    4.8.41    True        False         False      12d

$ oc get co etcd -o yaml|grep -A 4 lastTransitionTime
  - lastTransitionTime: "2022-04-19T22:33:10Z"
    message: |-
      NodeControllerDegraded: All master nodes are ready
      etcdMembersDegraded: No unhealthy members found
    reason: AsExpected
--
  - lastTransitionTime: "2022-04-19T22:40:50Z"
    message: |-
      NodeInstallerProgressing: 3 nodes are at revision 3
      etcdMembersProgressing: No unstarted etcd members found
    reason: AsExpected
--
  - lastTransitionTime: "2022-04-19T22:30:41Z"
    message: |-
      StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3
      etcdMembersAvailable: 3 members are available
    reason: AsExpected
--
  - lastTransitionTime: "2022-04-19T22:28:11Z"
    message: All is well
    reason: AsExpected
    status: "True"
    type: Upgradeable

etcd heartbeat

As etcd uses a leader-based consensus protocol for consistent data replication and log execution, it relies in a heartbeat mechnism to keep the cluster members in a healthy state.

If etcd is logging messages like failed to send out heartbeat on time, means that your etcd cluster is facing instabilities, slow response times and may cause unexpected leadership changes which directly affects RHOCP control plane.

Here is a sample from etcd logs:

2022-07-06 14:48:38.562412 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 34.328790ms)

This issue may have multiple causes, but most commonly triggered by a slow or shared backend storage devices. Although CPU and networking overload in the etcd members can also cause it, its less frequent. You can also review Content from etcd.io is not included.upstream etcd faq for additional information.

Healthy etcd clusters must not have any of these events in the logs:

$ oc get pods -n openshift-etcd|grep etcd|grep -v quorum|grep -v guard|while read POD line; do echo "### POD $POD"; oc logs $POD -c etcd -n openshift-etcd| grep 'failed to send out heartbeat on time'| wc -l; done 
### POD etcd-ocp-master01.local.ocp
32
### POD etcd-ocp-master02.local.ocp
38
### POD etcd-ocp-master03.local.ocp
26

etcd is likely overloaded

etcd heartbeat messages are followed by server is likely overloaded:

2022-07-06 17:43:49.342412 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 34.328790ms
2022-07-06 17:43:49.342430 W | etcdserver: server is likely overloaded

Same as heartbeat messages, healthy etcd clusters must not have any of these events in the logs:

$ oc get pods -n openshift-etcd|grep etcd|grep -v quorum|while read POD line; do oc logs $POD -c etcd -n openshift-etcd| grep "server is likely overloaded"| wc -l; done
0
0
0

etcd warning “apply entries took too long”

As per required by its consensus protocol implementation, after a majority of etcd members agree to commit a request, each etcd server applies the request to its data store and persists the result to disk. If the average request duration exceeds 100 milliseconds, etcd will warn entries Content from etcd.io is not included.request took too long.):

2022-06-20T17:51:50.325433212Z {"level":"warn","ts":"2022-06-20T17:51:50.325Z","caller":"etcdserver/util.go:163","msg":"apply request took too long","took":"232.747298ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/kubernetes.io/namespaces/default\" keys_only:true ","response":"range_response_count:1 size:54"}

This is issue most commonly triggered by a slow or shared backend storage devices, but, less frequently, also by CPU overload as well.

It may be observed during upgrades or machine config operations, but, during normal operations, there should be ideally no messages like that. In this case, additional metrics and log events must be take into consideration as well.

Ideally, healthy clusters should not have any of these events in the logs:

$ oc get pods -n openshift-etcd|grep etcd|grep -v quorum|while read POD line; do oc logs $POD -c etcd -n openshift-etcd| grep 'took too long'| wc -l; done
0
0
0

etcd clock difference

When clocks are out of sync with each other they are causing I/O timeouts and the liveness probe is failing which makes the etcd pod to restart frequently. The following log entries can be observed in that sitution:

2021-09-24T06:39:16.408674158Z 2021-09-24 06:39:16.408617 W | rafthttp: the clock difference against peer ab37b46865cdfbf7 is too high [4m18.466926704s > 1s]
2021-09-24T06:39:16.465279570Z 2021-09-24 06:39:16.465225 W | rafthttp: the clock difference against peer ab37b46865cdfbf7 is too high [4m18.463381838s > 1s]

Make sure that Chrony is enabled, running, and in sync with chronyc sources and chronyc tracking.

Healthy etcd clusters must not have any of these events in the logs:

$ oc get pods -n openshift-etcd|grep etcd|grep -v quorum|while read POD line; do oc logs $POD -c etcd -n openshift-etcd| grep 'clock difference'| wc -l; done
0
0
0

etcd database space exceeded

Without periodically compacting this history (e.g., by setting --auto-compaction), etcd will eventually exhaust its storage space. If etcd runs low on storage space, it raises a space quota alarm to protect the cluster from further writes. So long as the alarm is raised, etcd responds to write requests with the error Content from etcd.io is not included.mvcc: database space exceeded.

In RHOCP 4.x, history compaction is performed automatically every five minutes and leaves gaps in the back-end database. This fragmented space is available for use by etcd, but is not available to the host file system. You must defragment etcd to make this space available to the host file system.

Starting in RHOCP 4.9.z, defragmentation occurs automatically.

To recover it from this issue in earlier versions, you can also trigger defragmentation manually.

Healthy etcd clusters must not have any of these events in the logs:

$ oc get pods -n openshift-etcd|grep etcd|grep -v quorum|while read POD line; do oc logs $POD -c etcd -n openshift-etcd| grep 'database space exceeded'| wc -l; done
0
0
0

etcd leadership changes and failures

Leadership changes are expected only during installations, upgrades or machine config operations. During day to day operation, it must be avoided.

Here is a example from this warning:

2020-09-15 01:33:49.736399 W | etcdserver: read-only range request "key:\"/kubernetes.io/health\" " with result "error:etcdserver: leader changed" took too long (1.879359619s) to execute
2020-09-15 01:33:49.736422 W | etcdserver: read-only range request "key:\"/kubernetes.io/health\" " with result "error:etcdserver: leader changed" took too long (4.851652166s) to execute

During the Content from etcd.io is not included.leader election the cluster cannot process any writes. Write requests sent during the election are queued for processing until a new leader is elected. Until a new leader is elected, we are going to observe instabilities, slow response times and unexpected behaviors affecting RHOCP control plane.

This is issue is a side effect from previous events described in this article, and may have multiple contributor factors evolved.

Healthy etcd clusters must not have any of these events in the logs during normal operations:

$ oc get pods -n openshift-etcd|grep etcd|grep -v quorum|while read POD line; do oc logs $POD -c etcd -n openshift-etcd| grep 'leader changed'| wc -l; done
0
0
0
Category
Article Type