How to use 'fio' to check etcd disk performance in OpenShift

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP, OCP)
    • 3.11
    • 4

Issue

  • etcd has delicate disk response requirements, and it is often necessary to ensure that the speed that etcd writes to its backing storage is fast enough for production workloads.

  • etcd alerts from the web console or frequent error messages such as the below may suggest that writes are taking too long:

    2020-10-21T09:56:00.246667768Z 2020-10-21 09:56:00.246542 W | etcdserver: read-only range request "key:\"/kubernetes.io/serviceaccounts/openshift-kube-scheduler/localhost-recovery-client\" " with result  "range_response_count:1 size:407" took too long (113.372697ms) to execute
    
  • The performance Content from etcd.io is not included.documentation on etcd suggests that in production workloads, wal_fsync_duration_seconds p99 duration should be less than 10ms to confirm the disk is reasonably fast.

  • Depending on the severity of disk speed issues, impact can range from frequent alerting to overall cluster instability.

  • For more general information regarding infrastructure requirements, please see etcd backend performance requirements.

Resolution

Red Hat now provides an image that is designed to test the performance of etcd in OpenShift 3 and 4.

IMPORTANT NOTE: The fio test is a sort test executed at specific moment. It can show if the disk is not fast enough to support the etcd requirements, but as other loads in the disk could affect the etcd performance in the long term, causing it to not behaves correctly, it is not recommended to only trust fio results. It is recommended to check the etcd metrics for several hours and even days instead, to know the real etcd behavior for longer time as explained in how to graph etcd metrics using Prometheus to gauge etcd performance in OpenShift.

Running fio test


If still wanting to check `fio` results after checking above note, the container can be executed with the following procedure:
  • For OpenShift 4 connect to a Master node using oc debug node/<master_node> and run a container with the image using podman:

    $ oc debug node/<master_node>
    [...]
    sh-4.4# chroot /host bash
    [root@<master_node> /]# podman run --volume /var/lib/etcd:/var/lib/etcd:Z quay.io/cloud-bulldozer/etcd-perf
    
  • For OpenShift 3.11 connect to a Master node using ssh and run a container with the image:

    $ sudo docker run --volume /var/lib/etcd:/var/lib/etcd:Z quay.io/cloud-bulldozer/etcd-perf
    

The container will be pulled by the client and run a specially tailored version of fio. The results provide the 99th percentile of fsync and if it is in the recommended threshold to host etcd or not.

The following is an example output:

---------------------------------------------------------------- Running fio ---------------------------------------------------------------------------
{
  "fio version" : "fio-3.35",
  "timestamp" : 1718044942,
  "timestamp_ms" : 1718044942749,
  "time" : "Mon Jun 10 18:42:22 2024",
  "global options" : {
    "rw" : "write",
    "ioengine" : "sync",
    "fdatasync" : "1",
    "directory" : "/var/lib/etcd",
    "size" : "100m",
    "bs" : "8000"
  },

[...]
INFO: 99th percentile of fsync is 2441216 ns
INFO: 99th percentile of the fsync is within the recommended threshold: - 10 ms, the disk can be used to host etcd

and the message if the disk is not fast enough:

INFO: 99th percentile of fsync is 65798144 ns
WARN: 99th percentile of the fsync is greater than the recommended value which is 65798144 ns > 10 ms, faster disks are recommended to host etcd for better performance

As a reminder, note that the fio test is a sort test executed at specific moment. It can show if the disk is not fast enough to support the etcd requirements, but as other loads in the disk could affect the etcd performance in the long term, causing it to not behaves correctly, it is not recommended to only trust fio results. It is recommended to check the etcd metrics for several hours and even days instead, to know the real etcd behavior for longer time as explained in how to graph etcd metrics using Prometheus to gauge etcd performance in OpenShift.

Root Cause

etcd has delicate disk response requirements (see etcd backend performance requirements for OpenShift), and it is often necessary to ensure that the speed that etcd writes to its backing storage is fast enough for production workloads.

Diagnostic Steps

Check for "took too long" alerts in the etcd pod logs, or via master-logs etcd etcd in version 3.11:

2020-10-22T09:06:07.775029584Z 2020-10-22 09:06:07.774962 W | etcdserver: read-only range request "key:\"/kubernetes.io/secrets/openshift-kube-apiserver/user-serving-cert-005\" " with result "range_response_count:0 size:7" took too long (147.79882ms) to execute
2020-10-22T09:06:07.780085421Z 2020-10-22 09:06:07.780041 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-controller-manager-operator/openshift-controller-manager-operator-lock\" " with result "range_response_count:1 size:708" took too long (152.001071ms) to execute
2020-10-22T09:06:15.366295322Z 2020-10-22 09:06:15.366199 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-authentication/v4-0-config-system-console-config\" " with result "range_response_count:0 size:7" took too long (219.025347ms) to execute
2020-10-22T09:06:15.367341658Z 2020-10-22 09:06:15.367313 W | etcdserver: read-only range request "key:\"/kubernetes.io/serviceaccounts/openshift-storage/rook-csi-cephfs-provisioner-sa\" " with result "range_response_count:1 size:1688" took too long (218.732886ms) to execute
2020-10-22T09:06:15.367861972Z 2020-10-22 09:06:15.367822 W | etcdserver: read-only range request "key:\"/kubernetes.io/roles/kube-system/system:openshift:leader-election-lock-kube-controller-manager\" " with result "range_response_count:1 size:450" took too long (106.840136ms) to execute

The above error will log each time a range request takes longer than 100ms to execute.

In rarer cases, these messages can also indicate CPU starvation, network latency, or user requests that request too many keys at once.

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.