How to graph etcd metrics using Prometheus to gauge Etcd performance in OpenShift
Environment
- Red Hat Openshift Container Platform (RHOCP)
- 3.11
- 4
- etcd
Issue
- How to graph etcd
etcd_disk_wal_fsync_duration,etcd_disk_backend_commit_durationandetcd_disk_backend_commit_durationusing prometheus to gauge etcd performance in OCP? - How to graph the cpu
iowaitfor the etcd members in OpenShift. - How to check the etcd leader changes in Prometheus.
- How to see the etcd db size.
Resolution
Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.
Review etcd backend performance requirements for OpenShift for the recommended values of the metrics referenced in this KCS and additional information.
Check the etcd metrics in OpenShift web console
In the OpenShift web console, go to "Observe" > "Metrics" in the left menu.
IMPORTANT NOTES
- Make sure that the
Stackedcheckbox in the graph is unchecked.- Set the time into the graphs to 1 week (
1w) or any time frame including several days (if possible, at least 1 week/7 days).- If there is an issue during specific time, in addition to the query for several days, zoom in on the graph for a range of several hours around the issue (if possible, around 2 hours before and 2 hours later the issue happened).
- Collect graphs for each query individually, not several queries in the same graph.
metric etcd_disk_backend_commit_duration
The etcd_disk_back_commit_duration is the latency distribution of a commit called by the backend into the disk. fdatasync requires to commit the data to permanent storage.
To graph the etcd_disk_backend_commit_duration 99th percentile, the promql query histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) can be used in Prometheus.
As the recommended values for the etcd_disk_backend_commit_duration 99th should be lower than 25ms, the following queries could help to easily view if there are performance issues (please, collect graphs for each query individually, not several queries in the same graph):
histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) < 0.05
histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) > 0.03
metric etcd_disk_wal_fsync_duration
The count of latency distributions of fsync called by wal (Write-Ahead Log). A wal_fsync is called when etcd persists its log entries to disk before applying them. This is where etcd records all the changes before they get applied to the database.
sh-5.1# ls -lrth /var/lib/etcd/member/wal
total 367M
-rw-------. 1 root root 62M Feb 4 06:14 00000000000007d3-0000000000be6dbf.wal
-rw-------. 1 root root 62M Feb 4 06:34 00000000000007d4-0000000000be85c7.wal
-rw-------. 1 root root 62M Feb 4 06:53 00000000000007d5-0000000000be9e40.wal
-rw-------. 1 root root 62M Feb 4 07:10 00000000000007d6-0000000000beb6a0.wal
-rw-------. 1 root root 62M Feb 4 07:10 0.tmp
-rw-------. 1 root root 62M Feb 4 07:15 00000000000007d7-0000000000bece35.wal
To graph the etcd_disk_wal_fsync_duration 99th and 99.9th percentile durations, the promql queries histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) and histogram_quantile(0.999, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) can be used in Prometheus.
As the recommended values for theetcd_disk_wal_fsync_duration 99th and 99.9th should be lower than 10ms, the following queries could help to easily view if there are performance issues (please, collect graphs for each query individually, not several queries in the same graph):
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) < 0.02
histogram_quantile(0.999, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) < 0.02
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.015
histogram_quantile(0.999, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.015
metric apiserver_storage_objects
The number of objects in the cluster.
To graph the sum of apiserver_storage_objects, the following promql query can be used in Prometheus (please, collect graphs for each query individually, not several queries in the same graph):
sum(resource:apiserver_storage_objects:max)
Also, check the resources with more objects with the following promql (please, collect graphs for each query individually, not several queries in the same graph):
topk(10,resource:apiserver_storage_objects:max)
IMPORTANT: only for the list of the 10 resources with more objects, please take an screenshot of the graph and also an screenshot of the table. Collect also the list of objects in text format as explained in How to list the number of objects in ETCD?
metric for cpu iowait
To graph the cpu iowait , the following promql query can be used in Prometheus. Values in that graph should be below 4.0:
(sum(irate(node_cpu_seconds_total {mode="iowait"} [2m])) without (cpu)) / count(node_cpu_seconds_total) without (cpu) * 100 AND on (instance) label_replace( kube_node_role{role="master"}, "instance", "$1", "node", "(.+)" )
Note: If the
iowaitmetric fails for 1 week (1w), please select 4 days or 2 days instead (only for this metric).
metric etcd_network_peer_round_trip_time
The etcd_network_peer_round_trip_time metric is the duration it takes for a network request to go from one peer to another.
To graph the etcd_network_peer_round_trip_time 99th percentile, the promql query histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])) can be used in Prometheus.
As the recommended values for the etcd_network_peer_round_trip_time 99th should be lower than 50ms, the following queries could help to easily view if there are performance issues (please, collect graphs for each query individually, not several queries in the same graph):
histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])) < 0.1
histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])) > 0.06
Additional metrics
The following metrics will also help to check the leader changes and the etcd database size (please, collect graphs for each query individually, not several queries in the same graph):
changes(etcd_server_leader_changes_seen_total[1h])
etcd_mvcc_db_total_size_in_use_in_bytes
etcd_debugging_mvcc_db_total_size_in_bytes
etcd_server_quota_backend_bytes
Please reference the This page is not included, but the link has been rewritten to point to the nearest parent document.recommended etcd practices from the OCP documentation, and also the etcd github for interpreting these graphs and determining whether these values are appropriate (or indicating an underlying issue): Content from etcd.io is not included.What does the etcd warning "failed to send on time" mean?
To rule out a slow disk from causing this warning, monitor backend_commit_duration_seconds (p99 duration should be less than 25ms) to confirm the disk is reasonably fast. If the disk is too slow, assigning a dedicated disk to etcd or using faster disk will typically solve the problem.
..8<
To rule out a slow disk from causing this warning, monitor wal_fsync_duration_seconds (p99 duration should be less than 10ms) to confirm the disk is reasonably fast. If the disk is too slow, assigning a dedicated disk to etcd or using faster disk will typically solve the problem.
ETCD node disk specific metrics
OpenShift collects metrics that can be used to graph the average wait times and I/O utilization of disks used for etcd storage for a given period. This is more useful than the fio tests as it can be used to view performance at a specific historical point in time when the cluster was being impacted by slow etcd performance. To optimise the data being shown, it is necessary to identify the node and the storage device on the node and specify the correct instance and device fields respsectively for your specific cluster configuration.
Firstly, identify a pattern that can be used for selecting your control plane nodes where the etcd cluster is run:
$ oc get nodes
NAME STATUS ROLES AGE VERSION
master0.ocp.example.com Ready control-plane,master 5d10h v1.29.14+7cf4c05
master1.ocp.example.com Ready control-plane,master 5d10h v1.29.14+7cf4c05
master2.ocp.example.com Ready control-plane,master 5d10h v1.29.14+7cf4c05
worker0.ocp.example.com Ready worker 5d9h v1.29.14+7cf4c05
worker1.ocp.example.com Ready worker 5d9h v1.29.14+7cf4c05
worker2.ocp.example.com Ready worker 5d9h v1.29.14+7cf4c05
In this case, the nodes have a common master node naming pattern (later referred to as the <node_name_pattern>).
Launch a debug session on one of the etcd nodes and use lsblk to check the storage device names in use. First, run the debug command:
$ oc debug node/master0.ocp.example.com
Then, chroot to the node root and run the lsblk command:
# chroot /host
# lsblk
The output of these commands should look something like this:
$ oc debug node/master0.ocp.example.com
Starting pod/master0ocpexamplecom-debug-xxxx ...
To use host binaries, run `chroot /host`
Pod IP: xxx.xxx.xxx.xxx
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sr0 11:0 1 558K 0 rom
vda 252:0 0 30G 0 disk
|-vda1 252:1 0 1M 0 part
|-vda2 252:2 0 127M 0 part
|-vda3 252:3 0 384M 0 part /boot
`-vda4 252:4 0 29.5G 0 part /var
/sysroot/ostree/deploy/rhcos/var
/sysroot
/usr
/etc
/
The root device name is what is required. In clusters where etcd has been given its own dedicated disk, it will be the one mounted on /var/lib/etcd. In the above example, there is only one disk and we can see the /var directory and everything underneath including /var/lib/etcd is on partition of vda4, and the root device is called vda. Therefor vda is the device name to use (later referred to as the <device_name>).
Using the <node_name_pattern> and <device_name> determined above, replace the matching stubs in the following queries. In deployments where a common node name pattern cannot be determined, replace the whole instance=~"<node_name_pattern>.+" field with instance="<nodename>", where <nodename> identifies a single control plane node and repeat the graphs 3 times, once for each control plane node:
For disk read times:
irate(node_disk_read_time_seconds_total{instance=~"<node_name_pattern>.+",device="<device_name>",job="node-exporter"}[5m])/irate(node_disk_reads_completed_total{instance=~"<node_name_pattern>.+",device="<device_name>",job="node-exporter"}[5m])
For disk write times:
irate(node_disk_write_time_seconds_total{instance=~"<node_name_pattern>.+",device="<device_name>",job="node-exporter"}[5m]) / irate(node_disk_writes_completed_total{instance=~"<node_name_pattern>.+",device="<device_name>",job="node-exporter"}[5m])
For disk I/O time:
irate(node_disk_io_time_seconds_total{instance=~"<node_name_pattern>.+",device="<device_name>",job="node-exporter",} [5m])
Note: Regarding node disk metrics - read, write, I/O time -, there is no objective pass/fail value for these. These metrics serve to help troubleshoot potential performance issues, for eg., drilling down into individual master nodes disk performance. The thresholds will also vary depending on multiple factors like sizing of the cluster and number of objects in the cluster.
Root Cause
Clustered etcd is extremely sensitive to storage and network backend performance, and can be easily disrupted by any underlying bottlenecks.
Checking the metrics for several days (even one week) will show the overall etcd performance.
Diagnostic Steps
In the OpenShift web console, go to "Observe" > "Metrics" in the left menu and check the metrics and the queries shown in the "Resolution" section.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.