What does the etcd warning "failed to send out heartbeat on time" mean?

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (OCP) 3.10
  • Red Hat OpenShift Container Platform (OCP) 3.11
  • Red Hat OpenShift Container Platform (OCP) 4.x

Issue

  • etcd Pod logs shows the following warnings:

    W | etcdserver: server is likely overloaded.
    W | etcdserver: failed to send out heartbeat on time (exceeded the 500ms timeout for 78.009036ms)
    

Resolution

  • There can be multiple reasons that the above warnings are shown:

    1. Typically, these warnings are caused by slow disk I/O. Before the etcd leader sends heartbeats attached with metadata, it may need to persist the metadata to disk. The disk could be experiencing contention among etcd and other applications, or the disk is too slow (e.g., a shared virtualized disk). To rule out slow disk I/O from causing this warning, monitor the following etcd metrics:

      wal_fsync_duration_seconds: p99 duration should be less than 10ms (0.01s). wal_fsync is called when etcd persists its log entries to disk before applying them. If the this metric is showing a higher value, the disk I/O is too slow, thus assigning a dedicated disk to etcd, or using faster disk types will typically resolve the problem.

      backend_commit_duration_seconds:: p99 duration should be less than 25ms (0.025s). A backend_commit is called when etcd commits an incremental snapshot of its most recent changes to disk.

      For ruling out the issue of the slow disk the tool iostat can be used, which will capture the I/O metrics of the underlying disk. For more details on iostat please refer to the iostat article. In OpenShift Container Platform 4.x, there is now also the etcdctl check perf command to check the performance of the I/O used.

    2. The second most common cause is CPU starvation. If monitoring of the machines CPU usage shows heavy utilization, there may not be enough compute capacity for etcd. Increasing the amount of CPU assigned to the OpenShift Container Platform Master Node containing etcd usually resolves the issue. These issues can be observed using the sar or top output of the affected node.

    3. A slow network connection can also cause this issue. If network metrics among the etcd machines show increased latencies or a high package drop rate, there may not be enough network capacity for etcd. Moving etcd members to a less congested network will typically solve the problem.

    4. These warnings can also be observed when there is an excessive number of objects stored in etcd. Please refer to the solution describing on how to check the amount of objects stored in etcd.

Root Cause

  • etcd must persist proposals to its log, thus disk activity from other processes may cause high fsync latencies. The result is that etcd may miss "heartbeats", causing request timeouts and temporary leader loss.

  • High disk operation latencies (wal_fsync_duration_seconds or backend_commit_duration_seconds) often indicate disk issues.

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.