OpenShift 4 best practices for performance
Best practices
Best practices should be followed all the time, and regular checks should be in place to enforce those best practices.
Setup
The most important part of any best practices is its initial evaluation, and proper setup of a cluster. Without solid foundations and stability in the core components, it doesn't matter how well any software was developed, as it will run on unstable grounds.
IMPORTANT: most of the documentation talks about basic setup and minimal requirements. These are usually not fit for enterprise-grade production environments, where downtime of even 5 minutes does matter, and are rather fit for trial or development scenarios.
Never go with the minimum requirements if your cluster and production matter.
If you're planning for bigger loads in the future, you should prepare for it as soon as possible. A change of infrastructure is not trivial, and should not be needed if the initial setup is done right.
The most important component of cluster is the control plane, where API and ETCD run.
CPU and RAM
It's never good idea to allocate the minimum resources (like 4 CPUs). You should always test your application/operator and add more CPU and RAM according to the size of the cluster and workload running there. Running too many apps/operators might put bigger load on monitoring where more CPU or RAM, and also more storage might be required.
Networking
The network should have stable latency, and there should be no dropped packets or errors. In case of using SAN storage, higher latency means also slower storage performance, so this is also very important.
Monitoring
You should monitor your cluster and watch for any values that go above the allowed thresholds. How to graph ETCD metrics
IMPORTANT: any peaks above the thresholds should be of concern (in production) as they can cause instability, and the cluster not being available for several minutes.
Other things to monitor should be
- Restarts of pods
- Networking (dropped or missed packets, latency issues and errors)
- Storage performance
- Number of objects in ETCD
- Alarms from Prometheus
Testing real load
Even tough OpenShift and it's operators are being generally tested, they cannot predict what type of load they will need to handle, and therefore you should thoroughly test any third party application or operator before deploying it in production.
IMPORTANT: Running demos with basic load doesn't reflect performance of production under heavy load. Always performn stress tests on the cluster before going into production!
Tweaking and remediation
Some operators can be tweaked, but some don't allow any change of their parameters, or only allow minimal modifications (like This content is not included.etcd ). Always check the documentation.
One of examples could be the ability to disable copied OLM which helps with performance of large clusters with a lot of namespaces.
Operators
The Operator is a piece of software running in a Pod on the cluster, interacting with the Kubernetes API server.
The following points should be also for consideration when creating your own operators:
What is the Operator doing?
The main concern should be what the operator is doing, and what overhead it brings. Operators that do lot of API calling, or the ones that scan files, or create heavy IO/traffic could have big performance impact on the storage, CPU and network. Make sure you understand what the Operator is doing, how it affects overall performance and how you can tweak it to avoid such issues. Some resource hungry operators might require that you add extra CPU and RAM to your master nodes.
Where does the operator run?
If the operator also runs on the control plane nodes (antivirus, file or compliance scanner), it could have a performance impact on the control plane's storage (and therefore an impact on ETCD and overall cluster performance).
Does Operator/Pipeline/deployment creates any CRDs that are not automatically removed when Operator is uninstalled?
Yes, not all resources are removed so there is need for some manual cleanup and it is very important to clean up unused resources that could be referencing other unneeded resources like images or secrets.
Examples of cleanup could be
- Pruning of groups, Deployments and Builds
- Secrets (or other CRDs) cleanup
Any number of CRDs (Secrets, ReplicaSets, etc..) above 8000 could cause performance issues on the storage if it doesn't have enough IOPS. 3. Check the cluster sizing
$ oc project openshift-etcd
oc get pods
oc rsh <etcd pod>
> etcdctl get / --prefix --keys-only | sed '/^$/d' | cut -d/ -f3 | sort | uniq -c | sort -rn
Any non-OpenShift namespace with 20+ secrets should be cleaned up (unless there's a specific need for so many secrets). PS: The same could be done also with Deployments, Secrets and so on.
oc get secrets -A --no-headers | awk '{ns[$1]++}END{for (i in ns) print i,ns[i]}'
If you run any pipeline that creates CRDs, make sure that the pipeline also cleans up those CRDs.
Namespaces
By default, there are actually three namespaces that Kubernetes creates out of the box: "default", "kube-system" (used for Kubernetes components), and "kube-public" (used for public resources).
"kube-public" isn’t really used for much at the moment, and it’s usually a good idea to not modify anything in "kube-system".
On OpenShift we have several "openshift-xxxx" namespaces used for cluster components.
This leaves the "default" Namespace as the place where your services and apps are created.
This Namespace is set up out of the box and it can't be deleted. While it is great for getting started and for smaller production systems, it is recommended against using it in large production systems. This is because it is very easy for a team to accidentally overwrite or disrupt another service without even realizing it. Instead, create multiple namespaces and use them to segment your services into manageable chunks.
Creating many Namespaces doesn't create any performance issues, and in many cases can actually improve performance as the Kubernetes API will have a smaller set of objects to work with.
Do not overload namespaces with multiple workloads that perform unrelated tasks. Keep your namespaces clean and straightforward.
Resource Requests and Limits
Resource limits are the operating parameters that you provide to kubernetes, that tell two critical things about your workload: what resources it requires to run properly; and the maximum resources it is allowed to consume. The first is a critical input to the scheduler that enables it to choose the right node on which to run the pod. The second is important to the kubelet (the daemon on each node that is responsible for pod health)
resources:
requests:
cpu: 50m
memory: 50Mi
limits:
cpu: 100m
memory: 100Mi
Resource requests specify the minimum amount of resources a container can use Resource limits specify the maximum amount of resources a container can use.
Limit Resource Usage for Specific Namespaces
You can solve this through the application of resource quotas (ResourceQuotas CRD). This means that each namespace can have its own quota that limits the amount of resources it can use from the nodes.
Deploy your pods as part of a Deployment, DaemonSet, ReplicaSet, or StatefulSet across nodes.
A single pod should never be run individually but rather they should always be part of a Deployment, DaemonSet, ReplicaSet or StatefulSet to improve fault tolerance. The pods can then be deployed across nodes using anti-affinity rules in your Deployments to avoid all pods being run on a single node, which may cause downtime if it the node becomes unavailable.
NetworkPolicies
NetworkPolicies should be employed to restrict traffic between objects in the K8s cluster. By default, all containers can talk to each other in the network, something that presents a security risk if malicious actors gain access to a container, allowing them to traverse objects in the cluster. Network policies can control traffic at the IP and port level, similar to the concept of security groups in cloud platforms to restrict access to resources. Typically, all traffic should be denied by default, then allow rules should be put in place to allow required traffic.
Configuration
- When defining configurations, specify the latest stable API version.
- Configuration files should be stored in version control before being pushed to the cluster. This allows you to quickly roll back a configuration change if necessary. It also aids cluster re-creation and restoration.
- Write your configuration files using YAML rather than JSON. Though these formats can be used interchangeably in almost all scenarios, YAML tends to be more user-friendly.
Additional links
This content is not included.OpenShift 4 Resources Configuration: Methodology and Tools
This content is not included.Capacity management and overcommitment best practices in Red Hat OpenShift
Recommended infrastructure practices
This content is not included.14 Best Practices for Developing Applications on OpenShift