How to set QoS on DG 8 pods in OCP 4
Environment
- Red hat OpenShift Container Platform (OCP)
- 4.x
- Red Hat Data Grid (RHDG)
- 8.x
Issue
- How to set QoS on DG 8 pods in OCP 4?
- Given the operator cannot edit the deployment (container) on the DG 8 pod, how to set Quality of Service for its pods?
Resolution
There are three levels of QoS for pods: Guaranteed, Burstable , BestEffort.
QoS levels are a consequence of the memory/cpu settings set up on the custom resource - and are not set manually.
Therefore, on DG 8 Operator one can set the memory and cpu on the custom spec, which kubelet will use to spawn the pods and deduce its Quality of Service.
In other words, the QoS level will be calculated from there (spec.container.cpu and spec.container.memory) on the custom resource. See how to set it up here.
| QoS | definition | example |
|---|---|---|
Guaranteed | Set cpu and memory to the same level | spec. container: cpu: 2 memory: 2Gi |
Burstable | Set cpu and memory request different than limit | spec. container cpu: 2:1 memory: 2Gi:1Gi |
BestEFfort | Containers with no resource limits or requests set. | Example: N/A in DG pods. |
Again, to reinforce the definition:
Guaranteed: Both limits and requests are set and have the same valueBurstable: Both limits and requests set, but at different valuesBestEFfort: Containers with no resource limits or requests set.
Example - Setting as Guaranteed
One can set and then grep what is set from the pod the QoS provided by the pod creation via the custom resource (if both: limit and requests are the same) it should go as Guaranteed.
Given this:
spec:
container:
cpu: '2'
memory: 3Gi
And it will get this:
$ oc get pod $pod-name -o json | grep qosClass
"qosClass": "Guaranteed"
Example - Setting as Burstable
One can set and then grep what is set from the pod the QoS provided by the pod creation via the custom resource (if both: limit and requests are the same) it should go as Guaranteed.
Given this:
spec:
container:
cpu: '2:1'
memory: 3Gi:2Gi
And it will get this:
$ oc get pod $podname -o json | grep qosClass
"qosClass": "Burstable",
Different pods on the cluster have different QoS levels
As one can see the pod cluster pod has Guaranteed, whereas the other pods have BestEffort, what is the explanation for this change (some pods with guaranteed and others with best effort, but not Burstable)?
That's because on the custom resource setting above setting has the cpu/memory request/limits to be the same and therefore fulfilling the conditions for it to be QoS Guaranteed:
For every Container in the Pod, the CPU limit must equal the CPU request. - this is explained Content from kubernetes.io is not included.kubernetes docs
Whereas the other pods (config listener, controller, and gossip router) are not set, therefore they enter as BestEffort. For the gossip router pod and config listener it cannot be set.
So the Operator is already doing that for the user, given the correct setting on the custom resource - by given the proper spec.container resources setting.
FAQ
Q1. What are the levels of QoS
A1. Guaranteed, Burstable , BestEffort. Although the name sames Guaranteed, given right conditions the pod can be killed.
Q2. How to set the QoS on the DG 8 pod?
A2. Same as pods, one sets resource memory and cpu. For the Dg 8 operator, this is one via spec.container.memory and spec.container.cpu. If the requests and limits are the same, the pod's QoS will be Guaranteed.
Q3. Setting a pod Guaranteed makes it indestructible?
A3. No. It makes is less likely to be deleted/restart, but not indestructible. See root cause for reason.
Q4. Even setting cpu and memory equals the pods the config listener, controller, and gossip router have BestEffort why?
A4. Setting the cpu resource on the custom resource will set the cluster node settings, but the other pods: config-listener, gossip router, and controller, will have the default ones. One can set the controller's pod's QoS as below - on the Subscription custom resource (which the OLM listens to):
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: datagrid
spec:
channel: 8.3.x
installPlanApproval: Manual
name: datagrid
source: redhat-operators
sourceNamespace: openshift-marketplace
startingCSV: datagrid-operator.v8.3.6
config:
resources:
requests:
memory: "128Mi"
cpu: "500m"
limits:
memory: "128Mi"
cpu: "500m"
Root Cause
QoS for pods are calculated via spec.container.memory and spec.container.cpu set via custom resource.
How is this implemented: when the operator creates the deployment, it should be able to specify whatever parameters it wants for the operand deployment. the deployment spec has a pod template, which is used when creating new pods for the deployment. On the custom resource one set the spec.container.memory and spec.container.cpu, and if they are both the same kubernetes spawn the pod as guaranteed. Since the operator does not set the memory and cpu requirements for the other pods, it enters as BestEffort.
The OOM Killer is a kernel process that monitors and kills (terminates) pods close to memory exhaustion/experiencing memory exhaustion. If OOM Killer detects such exhaustion, will choose to kill those process. To rank the process for killing, the kernel keeps a score count, oom_scorethe higher the more changes to be killed - the QoS applies an offset those score (oome_score_adj) and to make BestEffort pods to get killed before and just later Guaranteed Pods. Therefore making a pod Guaranteed it doesn't make it indestructible - but rather lower the score.
Also note, there is a difference between cgroup oomkills and system oomkills - those are not the same:
- System OOM Kills: terminate the OCP nodes that have the pods - also known as host level OOME killer - host OS level the pods are killed.
- cgroup OOM Kills: terminate pods that exceed their cgroup memory limitation
Therefore, OOM kill in OCP nodes that does not have any memory pressure given the pod has high memory utilization cgroup (pod) exceeding memory limit.
To investigate see solution both OOME's and other causes of crashes see Troubleshoot options for Data Grid pod crash.
Java as non-elastic process
Finally, there is one more point to why resources limits/request should be set the same value: this is a java process, and the java processes are non-elastic, so it will star with Xmx and it will stay that value. It cannot change the Xmx at runtime - unless they implemented that on this operator via listener. Meaning at runtime/on the fly the number of cpus set by the JVM at start up cannot change - so java is not burstable in the sense that can start with 1 cpu (and certain number of threads) and later double it down. There are just a few number of parameters that the JVM can change on the fly, they are called manageable paramters - cpu and Xmx are not one of those.
Diagnostic Steps
- To get the pod current QoS:
oc get pod $pod-config-listener-id -o json | grep qosClass - To get what is set currently: verify the custom resource and then deployment
- For investigating both OOMEs (system and cgroups) see dmesg logs:
oc debug node/<node_name>
dmesg | grep oom
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.