Backend performance requirements for OpenShift etcd

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 3.11
    • 4
  • etcd

Issue

  • etcd performance can be impacted by poor storage and network performance, causing multiple errors:

    $ oc logs --follow=true etcd-ocp4-9wwcf-master-0 -c etcd -n openshift-etcd
    ...
    etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for xxx ms)
    etcdserver: server is likely overloaded
    etcdserver: read-only range request "key:\"xxxx" count_only:true " with result "xxxx" took too long (xxx s) to execute
    etcdserver: read-only range request "key:\"xxxx" count_only:true " with result "xxxx" took too long (xxxx ms) to execute
    etcdserver: read-only range request "xxxx" with result "xxxx" took too long (xxx ms) to execute
    wal: sync duration of xxxx s, expected less than 1s
    

Resolution

Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.

Most commonly, issues with etcd occur as a result of one (or several) of the following:

  • Slow storage
  • CPU overload
  • etcd database size growth.

Applying a request should normally take fewer than 50 milliseconds. If the average apply duration exceeds 200 milliseconds, etcd will warn that entries are taking too long to apply (took too long messages in the logs).

etcd metrics

The recommended way to check the etcd performance behavior over the time is to check the etcd metrics exposed. Some examples:

etcd database

For database size-related issues please refer to:

Additional information

Note: Beware that performance measurement may have significant impact on cluster health in case of existing performance issues, that said, proceed with these tests with care on production workload. Non-intrusive measurements can be get from exposed etcd metrics.

Refer to the article for etcd guidelines with OpenShift Container Platform 4 for additional information. More details about etcd performance can be found in upstream documentation: Content from etcd.io is not included.etcd performance FAQ.

Disk performance trobleshooting with

fio
Detailed information about using fio tool for etcd performance investigation can be found in the following articles:

IMPORTANT NOTE: The fio test is a sort test executed at specific moment. It can show if the disk is not fast enough to support the etcd requirements, but as other loads in the disk could affect the etcd performance in the long term, causing it to not behaves correctly, it is not recommended to only trust fio results. It is recommended to check the etcd metrics for several hours and even days instead, to know the real etcd behavior for longer time as explained in how to graph etcd metrics using Prometheus to gauge etcd performance in OpenShift.

Root Cause

Clustered etcd is extremely sensitive to storage and network backend performance, and can be easily disrupted by any underlying bottlenecks.

Diagnostic Steps

Check etcd logs for the following messages:

$ oc logs --follow=true etcd-ocp4-9wwcf-master-0 -c etcd -n openshift-etcd
...
etcdserver: failed to send out heartbeat on time
etcdserver: server is likely overloaded
wal: sync duration of xxxx s, expected less than 1s

etcd logs can be viewed either from OpenShift Web console or using oc logs command-line tool.

  • OpenShift Container Platform 3.11: etcd is located in kube-system project
  • OpenShift Container Platform 4.x: etcd is located in openshift-etcd project.
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.