OpenShift 4 disk performance degradation on Azure

Solution Verified - Updated

Environment

  • Red Hat Openshift Container Platform (RHOCP)
    • 4
  • Azure

Issue

  • Disk performance issues on OpenShift 4 clusters on Azure.

  • OpenShift clusters performance issues on Azure.

  • etcd operator becomes degraded, and related messages appearing in the etcd logs:

    etcdserver: read-only range request ... took too long to execute
    
    embed: rejected connection from "X.X.X.X:X" (error "EOF", ServerName "")
    

Resolution

Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.

The default disk size for the control plane for both IPI and UPI installations is 1024 GB as explained in why is the minimum recommended size of disk for control plane nodes 1024 GB when installing OpenShift 4 on Azure.

IMPORTANT NOTE: the host caching option for the Azure machines must be set to ReadOnly as explained in optimizing storage performance for Microsoft Azure, specially for control plane nodes.

To avoid most of the disk performance issues in Azure (including etcd bad performance), the recommendation is to move etcd to a different disk (when possible Content from learn.microsoft.com is not included.Premium SSD v2 disk), but that procedure is not supported when using control plane machine sets, as it requires the disk to be available when the machine is created. For IPI installations not using control plane machine sets, it could be possible to use that procedure if the additional disk is added when the machine is created.

Note: starting with OpenShift 4.20, there is a Technology Preview feature to allow configuring additional etcd disks on Azure IPI installations as shown in dedicated disk for etcd on Microsoft Azure (Technology Preview) and detailed in configuring a dedicated disk for etcd. Unfortunately that feature is only available for new installed clusters. For clusters already installed, there is and internal RFE (RFE-5214), still under discussion.

When it is not possible to move etcd to a different disk, the minimal recommendation for the control plane nodes is to use Content from learn.microsoft.com is not included.1TB Premium SSD disks (P30) which gives 5000 IOPS/200 MBps, in combination with at minimum Content from learn.microsoft.com is not included.Standard_D8s_v3 instances, which support the throughput for uncached writes of 12800 IOPS/192 MBps (note that the MBps supported by the DSv3 instance type is less than the ones supported by the P30 disks).
As Standard_D8s_v3 is an old generation instance type no longer available for new machines, and newer instance types with better performance were released, using a newer instance type (like, at the time of writting this, the Content from learn.microsoft.com is not included.Standard_D8s_v6 instances, which supports 12800 IOPS/424 MBps for uncached writes is recommended). Note that fastest disks than P30 could be needed to take advantage of the 424 MBps supported by newer instance types.

ADDITIONAL NOTES:

  • The disk performance could vary between Azure regions as shown in Azure Disk performance by region (note the data there is not up-to-date), and in some cases, the minimal recommendation above could be not enough for good disk performance (including etcd performance). In those cases, it will be needed to check the etcd metrics and contact Azure Support for the disk performance if the metrics values are not according with the instance and disk documented performance. It is also possible to follow the "Diagnostic Steps" section of that linked solution for testing the performance in specific Azure region, but note that several tests should be done and not only one.

  • For other OpenShift components, like for example StorageClasses or OpenShift Data Foundation, it could be beneficial to use more performant disks like Premium SSD v2 disks or Ultra Disks. Refer to Content from learn.microsoft.com is not included.Azure managed disk types for the limitations of each kind of disk (like Premium SSD v2 disks and Ultra disks cannot be used as OS disks).

Root Cause

In Azure, disk performance is directly dependent on SSD disk sizes and Service Quotas (and they are layered). The Service Quotas to be considered must always be the lower ones. When limits are exceeded, the service itself (storage, compute, networking, etc) throttles the offending entity.

Many of the issues observed are somehow related to clusters reaching disk or VM I/O limits and saturation, which leads to Azure throttling, then the OS disk becoming slow or unresponsive from the kernel perspective blocked on IOWait, leading to failures in fstat() and different parts of the cluster stack failing or degrading, sometimes in the most surprising ways.

Because of the Service Quotas layering mentioned before, it is very important to get the IOPS numbers correctly aligned between the VM OS disk quotas and the quotas of the attached disk. The recommendations for control plane VM sizes and disk sizes take into account the hardware requirements for disks in etcd with a target of 5000 concurrent (likely ~500 sequential) IOPS.

Diagnostic Steps

After some time, the cluster becomes very slow and many operators start to become unhealthy. With such error in the etcd logs :

W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-kube-scheduler/scheduler-kubeconfig\" " with result "range_response_count:1 size:*" took too long to execute
W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-kube-apiserver-operator/kube-apiserver-operator-lock\" " with result "range_response_count:1 size:*" took too long to execute
W | etcdserver: read-only range request "key:\"/kubernetes.io/leases/knative-serving/hpaautoscaler\" " with result "range_response_count:1 size:*" took too long to execute
W | etcdserver: read-only range request "key:\"/kubernetes.io/operators.coreos.com/clusterserviceversions/openshift-operators/nfd.4.4.0-202005252114\" " with result "range_response_count:1 size:*" took too long to execute
W | etcdserver: read-only range request "key:\"/kubernetes.io/eventing.knative.dev/triggers\" range_end:\"/kubernetes.io/eventing.knative.dev/triggert\" count_only:true " with result "range_response_count:0 size:*" took too long to execute
W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-cloud-credential-operator/cloud-credential-operator-leader\" " with result "range_response_count:1 size:*" took too long to execute
W | etcdserver: read-only range request "key:\"/kubernetes.io/secrets/openshift-config-managed/\" range_end:\"/kubernetes.io/secrets/openshift-config-managed0\" " with result "range_response_count:27 size:*" took too long to execute
I | embed: rejected connection from "X.X.X.X:X" (error "EOF", ServerName "")
etcdserver: failed to send out heartbeat on time
etcdserver: server is likely overloaded
wal: sync duration of xxxx s, expected less than 1s
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.