What does the etcd warning "failed to send out heartbeat on time" mean?
Environment
- Red Hat OpenShift Container Platform (OCP) 3.10
- Red Hat OpenShift Container Platform (OCP) 3.11
- Red Hat OpenShift Container Platform (OCP) 4.x
Issue
-
etcdPod logs shows the following warnings:W | etcdserver: server is likely overloaded. W | etcdserver: failed to send out heartbeat on time (exceeded the 500ms timeout for 78.009036ms)
Resolution
-
There can be multiple reasons that the above warnings are shown:
-
Typically, these warnings are caused by slow disk I/O. Before the etcd leader sends heartbeats attached with metadata, it may need to persist the metadata to disk. The disk could be experiencing contention among etcd and other applications, or the disk is too slow (e.g., a shared virtualized disk). To rule out slow disk I/O from causing this warning, monitor the following etcd metrics:
wal_fsync_duration_seconds: p99 duration should be less than 10ms (0.01s).wal_fsyncis called whenetcdpersists itslog entriestodiskbefore applying them. If the this metric is showing a higher value, the disk I/O is too slow, thus assigning a dedicated disk to etcd, or using faster disk types will typically resolve the problem.backend_commit_duration_seconds:: p99 duration should be less than 25ms (0.025s). Abackend_commitis called when etcd commits an incremental snapshot of its most recent changes to disk.For ruling out the issue of the slow disk the tool
iostatcan be used, which will capture the I/O metrics of the underlying disk. For more details oniostatplease refer to the iostat article. In OpenShift Container Platform 4.x, there is now also theetcdctl check perfcommand to check the performance of the I/O used. -
The second most common cause is CPU starvation. If monitoring of the machines CPU usage shows heavy utilization, there may not be enough compute capacity for etcd. Increasing the amount of CPU assigned to the OpenShift Container Platform Master Node containing etcd usually resolves the issue. These issues can be observed using the
sarortopoutput of the affected node. -
A slow network connection can also cause this issue. If network metrics among the etcd machines show increased latencies or a high package drop rate, there may not be enough network capacity for etcd. Moving etcd members to a less congested network will typically solve the problem.
-
These warnings can also be observed when there is an excessive number of objects stored in etcd. Please refer to the solution describing on how to check the amount of objects stored in etcd.
-
Root Cause
-
etcd must persist proposals to its log, thus disk activity from other processes may cause high
fsynclatencies. The result is that etcd may miss "heartbeats", causing request timeouts and temporary leader loss. -
High disk operation latencies (
wal_fsync_duration_secondsorbackend_commit_duration_seconds) often indicate disk issues.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.