Backend performance requirements for OpenShift etcd

Solution Verified - Updated 21 Apr 2026

Environment

Red Hat OpenShift Container Platform (RHOCP)
- 3.11
- 4
etcd

Issue

etcd performance can be impacted by poor storage and network performance, causing multiple errors:

$ oc logs --follow=true etcd-ocp4-9wwcf-master-0 -c etcd -n openshift-etcd
...
etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for xxx ms)
etcdserver: server is likely overloaded
etcdserver: read-only range request "key:\"xxxx" count_only:true " with result "xxxx" took too long (xxx s) to execute
etcdserver: read-only range request "key:\"xxxx" count_only:true " with result "xxxx" took too long (xxxx ms) to execute
etcdserver: read-only range request "xxxx" with result "xxxx" took too long (xxx ms) to execute
wal: sync duration of xxxx s, expected less than 1s

Resolution

Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.

Most commonly, issues with etcd occur as a result of one (or several) of the following:

Slow storage
CPU overload
etcd database size growth.

Applying a request should normally take fewer than 50 milliseconds. If the average apply duration exceeds 200 milliseconds, etcd will warn that entries are taking too long to apply (took too long messages in the logs).

etcd metrics

The recommended way to check the etcd performance behavior over the time is to check the etcd metrics exposed. Some examples:

To rule out a slow disk from causing etcd warnings, monitor the metrics etcd_disk_backend_commit_duration_seconds_bucket (99th percentile (p99) duration should be less than 25ms) and etcd_disk_wal_fsync_duration_seconds_bucket (p99 duration should be less than 10ms) to confirm the storage is reasonably fast.
The overall etcd cluster latency comes from two values:
- network RTT latency: Big network latency and packet drops can also bring an unreliable etcd cluster state, so network health values (RTT and packet drops) should be monitored. You can monitor metric etcd_network_peer_round_trip_time_seconds_bucket (p99 duration should be less than 50ms).
- Storage IO latency: investigate the storage performance using the fio tools. See below for additional information.
Refer to how to graph etcd metrics using Prometheus to gauge etcd performance in OpenShift.
Additional information about those and other etcd metrics in This page is not included, but the link has been rewritten to point to the nearest parent document.recommended etcd practices
More information about general OpenShift metrics in documentation, section: This page is not included, but the link has been rewritten to point to the nearest parent document.Cluster Monitoring.

etcd database

For database size-related issues please refer to:

Additional information

Note: Beware that performance measurement may have significant impact on cluster health in case of existing performance issues, that said, proceed with these tests with care on production workload. Non-intrusive measurements can be get from exposed etcd metrics.

Refer to the article for etcd guidelines with OpenShift Container Platform 4 for additional information. More details about etcd performance can be found in upstream documentation: Content from etcd.io is not included.etcd performance FAQ.

Disk performance trobleshooting with

fio
Detailed information about using fio tool for etcd performance investigation can be found in the following articles:

IMPORTANT NOTE: The fio test is a sort test executed at specific moment. It can show if the disk is not fast enough to support the etcd requirements, but as other loads in the disk could affect the etcd performance in the long term, causing it to not behaves correctly, it is not recommended to only trust fio results. It is recommended to check the etcd metrics for several hours and even days instead, to know the real etcd behavior for longer time as explained in how to graph etcd metrics using Prometheus to gauge etcd performance in OpenShift.

Root Cause

Clustered etcd is extremely sensitive to storage and network backend performance, and can be easily disrupted by any underlying bottlenecks.

Diagnostic Steps

Check etcd logs for the following messages:

$ oc logs --follow=true etcd-ocp4-9wwcf-master-0 -c etcd -n openshift-etcd
...
etcdserver: failed to send out heartbeat on time
etcdserver: server is likely overloaded
wal: sync duration of xxxx s, expected less than 1s

etcd logs can be viewed either from OpenShift Web console or using oc logs command-line tool.
- OpenShift Container Platform 3.11: etcd is located in kube-system project
- OpenShift Container Platform 4.x: etcd is located in openshift-etcd project.

SBR

Shift

Product(s)

Red Hat OpenShift Container Platform

Components

etcd

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.