Understanding etcd and the tunables/conditions affecting performance
Leader election and log replication of etcd
etcd is a consistent distributed key-value store operating as a cluster of replicated nodes. Following the Raft algorithm, etcd operates by electing one node as the leader and the others as followers. The leader maintains the system's current state and ensures that the followers are up-to-date.
The leader node is responsible for log replication. It handles incoming write transactions from the client and writes a Raft log entry that it then broadcasts to the followers.

When an etcd client like kube-apiserver connects to an etcd member that is requesting an action that requires a quorum, such as writing a value, if the etcd member is a follower, it returns a message indicating the transaction should be sent to the leader.

When the etcd client requests an action that requires a quorum, such as writing a value, from the leader, the leader maintains the client connection open while it writes the local Raft log, broadcasts the log to the followers, and waits for the majority of the followers to acknowledge to have committed the log without failures. Only then does the leader send the acknowledgment to the etcd client and close the session. If failure notifications are received from the clients and fail to reach a consensus, the leader returns the error message to the client and closes the session.
Note: For more information about etcd, see these external references:
- Content from etcd.io is not included.The etcd learner design
- Content from etcd.io is not included.Failure modes
OpenShift Container Platform timer tunables for etcd
OpenShift Container Platform maintains etcd timers that are optimized for each platform. OpenShift Container Platform has prescribed validated values that are optimized for each platform provider. The default etcd timers with platform=none or platform=metal are as follows:
- name: ETCD_ELECTION_TIMEOUT
value: "1000"
...
- name: ETCD_HEARTBEAT_INTERVAL
value: "100"
From an etcd perspective, the two key parameters are election timeout and heartbeat interval:
>Heartbeat interval: The frequency with which the leader notifies followers that it is still the leader.
>Election timeout: This timeout is how long a follower node will go without hearing a heartbeat before it attempts to become leader itself.
These parameters do not provide the whole story for the control plane or even etcd. An etcd cluster is sensitive to disk latencies. Because etcd must persist proposals to its log, disk activity from other processes might cause long fsync latencies. The consequence is that etcd might miss heartbeats, causing request timeouts and temporary leader loss. During a leader loss and reelection, the Kubernetes API cannot process any request that causes a service-affecting event and instability of the cluster.
Effects of disk latency on etcd
An etcd cluster is sensitive to disk latencies. To understand the disk latency that is experienced by etcd in your control plane environment, run the fio tests, or suite, as described in "How to use fio”, to check etcd disks performance in OpenShift Container Platform.
Make sure that the final report classifies the disk as appropriate for etcd, as shown in the following example:
...
99th percentile of fsync is 5865472 ns
99th percentile of the fsync is within the recommended threshold: - 20 ms, the disk can be used to host etcd
When a high latency disk is used, a message states that the disk is not recommended for etcd, as shown in the following example:
...
99th percentile of fsync is 15865472 ns
99th percentile of the fsync is greater than the recommended value which is 20 ms, faster disks are recommended to host etcd for better performance
When you use cluster deployments that span multiple data centers that are using disks for etcd that do not meet the recommended latency, it increases the chances of service-affecting failures and dramatically reduces the network latency that the control plane can sustain.
Effects of network latency and jitter on etcd
Use the tools that are described in the MTU discovery and validation section to obtain the average and maximum network latency.
The value of the heartbeat interval should be around the maximum of the average round-trip time (RTT) between members, normally around 1.5x the round-trip time. With the OpenShift Container Platform default heartbeat interval of 100ms, the recommended RTT between control plane nodes is to be less than ~33ms with a maximum of less than 66ms (66ms x 1.5 = 99ms). For more information, see the Content from etcd.io is not included.etcd tuning docs. Any network latency that is higher might cause service-affecting events and cluster instability.
What network latency should the organization see? The network latency is influenced by factors that include the technology of the transport networks, such as copper, fiber, wireless, or satellite, the number and quality of the network devices in the transport network, and many other factors. A good evaluation reference is the comparison of the network latency in the organization with the commercial latencies that are published by telecommunications providers, such as Monthly IP Latency Statistics.
Consider network latency with network jitter for more accurate calculations. Network jitter is the variance in network latency or, more specifically, the variation in the delay of received packets. On ideal network conditions, the jitter should be as close to zero as possible. Network jitter affects the network latency calculations for etcd because the actual network latency over time will be the RTT +/- Jitter. In other words, a network with a maximum latency of 80ms and jitter of 30ms will experience latencies of 110ms, which means etcd will be missing heartbeats, causing request timeouts and temporary leader loss. During a leader loss and reelection, the Kubernetes API cannot process any request that causes a service-affecting event and instability of the cluster.
Measure the network jitter among all control plane nodes. To do so, you can use the iperf3 tool in UDP mode.
Note: Red Hat does not provide a supported container image with iPerf. The following KCS article documents a way to build and run custom iperf container images:
- KCS 5233541 - Testing Network Bandwidth in OpenShift using iPerf Container
- KCS 6129701 - How to run iPerf network performance test in OpenShift 4
By using the container image from KCS 6129701, you can take the following steps to measure jitter between two nodes. Connect to one of the control plane nodes and run the iperf container as iperf server in host network mode. When you are running in server mode, the tool accepts TCP and UDP tests.
Note: The quay.io/kinvolk/iperf3 image is not supported by Red Hat, and you are encouraged to build your own iperf image by following KCS 5233541 - Testing Network Bandwidth in OpenShift using iPerf Container.
# podman run -ti --rm --net host quay.io/kinvolk/iperf3 iperf3 -s
Connect to another control plane node and run the iperf in UDP client mode.
# podman run -ti --rm --net host quay.io/kinvolk/iperf3 iperf3 -u -c <node_iperf_server> -t 300
The default test runs for 10 seconds, and at the end, the client output shows the average jitter from the client perspective. Run the test for 5 minutes/300 seconds (-t 300).
# oc debug node/m1
Starting pod/m1-debug ...
To use host binaries, run `chroot /host`
Pod IP: 198.18.111.13
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# podman run -ti --rm --net host quay.io/kinvolk/iperf3 iperf3 -u -c m0
Connecting to host m0, port 5201
[ 5] local 198.18.111.13 port 60878 connected to 198.18.111.12 port 5201
[ ID] Interval Transfer Bitrate Total Datagrams
[ 5] 0.00-1.00 sec 129 KBytes 1.05 Mbits/sec 91
[ 5] 1.00-2.00 sec 127 KBytes 1.04 Mbits/sec 90
[ 5] 2.00-3.00 sec 129 KBytes 1.05 Mbits/sec 91
[ 5] 3.00-4.00 sec 129 KBytes 1.05 Mbits/sec 91
[ 5] 4.00-5.00 sec 127 KBytes 1.04 Mbits/sec 90
[ 5] 5.00-6.00 sec 129 KBytes 1.05 Mbits/sec 91
[ 5] 6.00-7.00 sec 127 KBytes 1.04 Mbits/sec 90
[ 5] 7.00-8.00 sec 129 KBytes 1.05 Mbits/sec 91
[ 5] 8.00-9.00 sec 127 KBytes 1.04 Mbits/sec 90
[ 5] 9.00-10.00 sec 129 KBytes 1.05 Mbits/sec 91
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams
[ 5] 0.00-10.00 sec 1.25 MBytes 1.05 Mbits/sec 0.000 ms 0/906 (0%) sender
[ 5] 0.00-10.04 sec 1.25 MBytes 1.05 Mbits/sec 1.074 ms 0/906 (0%) receiver
iperf Done.
On the iperf server, the output shows the jitter on every second interval, and the average is shown at the end. For the purpose of this test, you want to identify the maximum jitter that is experienced during the test, ignoring the output of the first second as it might contain an invalid measurement.
# oc debug node/m0
Starting pod/m0-debug ...
To use host binaries, run `chroot /host`
Pod IP: 198.18.111.12
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# podman run -ti --rm --net host quay.io/kinvolk/iperf3 iperf3 -s
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
Accepted connection from 198.18.111.13, port 44136
[ 5] local 198.18.111.12 port 5201 connected to 198.18.111.13 port 60878
[ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams
[ 5] 0.00-1.00 sec 124 KBytes 1.02 Mbits/sec 4.763 ms 0/88 (0%)
[ 5] 1.00-2.00 sec 127 KBytes 1.04 Mbits/sec 4.735 ms 0/90 (0%)
[ 5] 2.00-3.00 sec 129 KBytes 1.05 Mbits/sec 0.568 ms 0/91 (0%)
[ 5] 3.00-4.00 sec 127 KBytes 1.04 Mbits/sec 2.443 ms 0/90 (0%)
[ 5] 4.00-5.00 sec 129 KBytes 1.05 Mbits/sec 1.372 ms 0/91 (0%)
[ 5] 5.00-6.00 sec 127 KBytes 1.04 Mbits/sec 2.769 ms 0/90 (0%)
[ 5] 6.00-7.00 sec 129 KBytes 1.05 Mbits/sec 2.393 ms 0/91 (0%)
[ 5] 7.00-8.00 sec 127 KBytes 1.04 Mbits/sec 0.883 ms 0/90 (0%)
[ 5] 8.00-9.00 sec 129 KBytes 1.05 Mbits/sec 0.594 ms 0/91 (0%)
[ 5] 9.00-10.00 sec 127 KBytes 1.04 Mbits/sec 0.953 ms 0/90 (0%)
[ 5] 10.00-10.04 sec 5.66 KBytes 1.30 Mbits/sec 1.074 ms 0/4 (0%)
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams
[ 5] 0.00-10.04 sec 1.25 MBytes 1.05 Mbits/sec 1.074 ms 0/906 (0%) receiver
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
Add the calculated jitter as a penalty to the network latency. For example, if the network latency is 80ms and the jitter is 30ms, an effective network latency of 110ms should be considered for the purposes of the control plane. In this example, that goes above the 100ms threshold, and the system will miss heartbeats.
When you calculate the network latency for etcd, use the effective network latency, which is the sum of RTT + jitter. Using the average Jitter value for calculating the penalty is possible, but the cluster can sporadically miss heartbeats if the RTT + max(jitter) is higher than the etcd heartbeat timer. Consider using the 99th percentile or max jitter value for a more resilient deployment.
Effective Network Latency = RTT + max(jitter)
Effects of consensus latency on etcd
This procedure can be run only on an active cluster. The disk/network test that was mentioned earlier should be done while you plan a cluster deployment. This procedure validates and monitors cluster health after a deployment.
By using the etcdctl CLI, you can monitor the latency for reaching consensus as experienced by etcd. You must identify one of the etcd pods and then retrieve the endpoint health.
# oc get pods -n openshift-etcd -l app=etcd
NAME READY STATUS RESTARTS AGE
etcd-m0 4/4 Running 4 8h
etcd-m1 4/4 Running 4 8h
etcd-m2 4/4 Running 4 8h
# oc exec -ti etcd-m0 -- etcdctl endpoint health -w table
+----------------------------+--------+-------------+-------+
| ENDPOINT | HEALTH | TOOK | ERROR |
+----------------------------+--------+-------------+-------+
| https://198.18.111.12:2379 | true | 3.798349ms | |
| https://198.18.111.14:2379 | true | 7.389608ms | |
| https://198.18.111.13:2379 | true | 6.263117ms | |
+----------------------------+--------+-------------+-------+
For a better understanding of the etcd latency for consensus, the previous command can be run on a precise watch cycle for few minutes to observe the numbers remain below the ~66ms threshold. The closer the consensus time is to 100ms, the more likely the cluster will experience service-affecting events and instability.
# oc exec -ti etcd-m0 -- watch -dp -c etcdctl endpoint health -w table
+----------------------------+--------+-------------+-------+
| ENDPOINT | HEALTH | TOOK | ERROR |
+----------------------------+--------+-------------+-------+
| https://198.18.111.12:2379 | true | 9.533405ms | |
| https://198.18.111.13:2379 | true | 4.628054ms | |
| https://198.18.111.14:2379 | true | 5.803378ms | |
+----------------------------+--------+-------------+-------+
etcd peer round trip time impacts on performance
The etcd peer round trip time is not the same as the network round trip time. This calculation is an end-to-end test metric on how quickly something can be replicated among members.
The etcd peer round trip time is the metric that shows the latency of etcd to finish replicating a client request among all the etcd members.
OpenShift Container Platform console provides dashboards to visualize the various etcd metrics. In the console, click Observe > Dashboards and from the dropdown list, select etcd.
A plot that summarizes the etcd peer round trip time can be found near the end of the etcd dashboard page. Example of a plot showing a control plane with peer round trip time causing cluster instabilities.
Note: These etcd metrics are collected by the OpenShift metrics system in Prometheus. You can access them from the CLI by following KCS 5151831 - How to query from the command line Prometheus statistics.
# Get token to connect to Prometheus
SECRET=$(oc get secret -n openshift-user-workload-monitoring | grep prometheus-user-workload-token | head -n 1 | awk '{print $1 }')
export TOKEN=$(oc get secret $SECRET -n openshift-user-workload-monitoring -o json | jq -r '.data.token' | base64 -d)
export THANOS_QUERIER_HOST=$(oc get route thanos-querier -n openshift-monitoring -o json | jq -r '.spec.host')
Queries must be URL-encoded. The following example shows how to retrieve the metrics that are reporting the round trip time (in seconds) for etcd to finish replicating the client requests among the members:
# prometheus query
query="histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m]))"
# urlencoded query
encoded_query=$(printf "%s" $query |jq -sRr @uri)
# querying the OpenShift metrics service
curl -s -X GET -k -H "Authorization: Bearer $TOKEN" "https://$THANOS_QUERIER_HOST/api/v1/query?query=$encoded_query" | jq '.data.result[] | .metric.pod,.value[1]'
"etcd-m2"
"0.09318400000000004" # example ~93ms
"etcd-m0"
"0.050688" # example ~51ms
"etcd-m1"
"0.050688" # example ~51ms
Other relevant metrics to explore are as follows:
- etcd_disk_wal_fsync_duration_seconds_bucket metric reports the etcd WAL fsync duration.
- etcd_disk_backend_commit_duration_seconds_bucket metric reports the etcd backend commit latency duration.
- etcd_server_leader_changes_seen_total metric reports the leader changes.
Effects of database size on etcd
The etcd database size has a direct impact on the time to complete the etcd defragmentation process. OpenShift Container Platform automatically runs the etcd defragmentation on one etcd member at a time when it detects at least 45% fragmentation. During the defragmentation process, the etcd member cannot process any requests. On small etcd databases, the defragmentation process happens in less than a second. With larger etcd databases, the disk latency directly impacts the fragmentation time, causing additional latency, as operations are blocked while defragmentation happens.
The size of the etcd database is a factor to consider when network partitions isolate a control plane node for a period, and the control plane needs to resync after communication is re-established.
Minimal options exist for controlling the size of the etcd database, as it depends on the operators and applications in the system. When you consider the latency range under which the system will operate, account for the effects of synchronization or defragmentation per size of the etcd database.
The magnitude of the effects is specific to the deployment. The time to complete a defragmentation will cause degradation in the transaction rate, as the etcd member cannot accept updates during the defragmentation process. Similarly, the time for the etcd re-synchronization for large databases with high change rate affects the transaction rate and transaction latency on the system. Consider the following two examples for the type of impacts to plan for.
The first example of the effect of etcd defragmentation based on database size is that writing an etcd database of 1 GB to a slow 7200 RPMs disk at 80 Mbit/s takes about 1 minute and 40 seconds. In such a scenario, the defragmentation process takes at least this long, if not longer, to complete the defragmentation.
The second example of the effect of database size on etcd synchronization is that if there is a change of 10% of the etcd database during disconnection of one of the control plane nodes, the resync needs to transfer at least 100 MB. Transferring 100 MB over a 1 Gbps link takes 800 ms. On clusters with regular transactions with the Kubernetes API, the larger the etcd database size, the more network instabilities will cause control plane instabilities.
In OpenShift Container Platform, the etcd dashboard has a plot that reports the size of the etcd database, or you can obtain it from the CLI by using the etcdctl tool.
# oc get pods -n openshift-etcd -l app=etcd
NAME READY STATUS RESTARTS AGE
etcd-m0 4/4 Running 4 22h
etcd-m1 4/4 Running 4 22h
etcd-m2 4/4 Running 4 22h
# oc exec -t etcd-m0 -- etcdctl endpoint status -w simple | cut -d, -f 1,3,4
https://198.18.111.12:2379, 3.5.6, 1.1 GB
https://198.18.111.13:2379, 3.5.6, 1.1 GB
https://198.18.111.14:2379, 3.5.6, 1.1 GB
Effects of the Kubernetes API transaction rate on etcd
The Kubernetes API transaction rate that is possible when you are using stretched control plane is dependent on the characteristics of the particular deployment. It depends on the combination of the etcd disk latency, the etcd round trip time, and the size of objects that are being written to the API. As a result, when you use stretched control planes, the cluster administrators should test the environment to determine the sustained transaction rate that is possible for their environment. The kube-burner tool can be used for this purpose.
Determining Kubernetes API transaction rate for your environment
There is no way to pre-determine the transaction rate that the Kubernetes API will be able to handle without directly measuring it. One of the tools that can be used for load testing the control plane is kube-burner. The binary brings an opinionated OpenShift wrapper for testing OpenShift clusters. It can be used to test cluster or node density. For testing the control plane, kube-burner ocp has three workloads profiles: cluster-density, cluster-density-v2, and cluster-density-ms. Each workload profile creates a series of resources designed to load the control. For more information about each profile, see the kube-burner ocp workload documentation.
A detailed description of running kube-burner ocp is beyond the scope of this document. The following example shows a run that creates and deletes resources within 20 minutes:
# kube-burner ocp cluster-density-ms --churn-duration 20m --churn-delay 0s --iterations 10 --timeout 30m
OpenShift Container Platform UI console provides a dashboard with all the relevant API performance information. Click Observe > Dashboards, and from the Dashboards menu, click API Performance.
During the run, observe the API performance dashboard to understand how the control plane responds during load and the 99th percentile transaction rate it can achieve for the execution of various verbs and request rates by read and write. Use this information and the knowledge of your organization’s workload to determine the load that the organization can put in the clusters for the specific stretched control plane deployment.