Logging Collector Pods being restarted with OOMKill in RHOCP 4
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
- Red Hat OpenShift Logging (RHOL)
- 6
- Vector
- Collector pods
Issue
- Logging collector pods are experiencing
CrashLoopBackOffissue. - Multiple logging collector pods are in
CrashLoopBackOffstate. - The number of pods in
CrashLoopBackOffstate is increasing - It was just configured the Red Hat Logging stack and the collector pods are in
CrashLoopBackOff
Resolution
Delays or failing to the deliver the logs to the destination
Review if any errors are in the collector pods:
-
Go to the OpenShift Console > Observe > Dashboards > Dashboard: Logging / Collection and review the Total errors last 60m Dashboard if errors are present that could justify the back pressure
-
Review in the collector logs itself for errors that could indicate any issue delivering the logs to the destinations
$ for pod in $(oc get pods -l app.kubernetes.io/component=collector -o name -n <namespace> ); do oc logs $pod -n <namespace> |grep -i error; done
If errors exist that indicate failures in delivering the logs, this should be fixed first.
As part of the normal activity depending on the number of logs needed to process and send and filters applied
Depending on the number of logs produced and the filters applied, the memory and cpu usage can be different from one collector to other. If the memory limit is hit leading to the pod to CrashLoopBackOff caused by OOMKill, at least that it's reduced the number of logs read and filtered, the solution is to increase the limits.memory.
For increasing the limits.memory following the Red Hat Documentation Section "Configure log collector CPU and memory limits".
Note: when the collector pods are configured for the first time in the system, they will need to read all the logs available in the system from days ago. This could generate extra pressure in the memory and cpu that could be reduced when all these logs from the past are read and it becomes only needed to read the current logs produced. After this peak in memory and cpu usage, it could be considered to reduce the limits.cpu and limits.memory verifying the normal cpu and memory used as indicated in the Red Hat Knowledge Article "Troubleshooting the Vector collector in RHOCP 4"
Root Cause
The collectors memory usage can be caused by:
- slowness delivering the logs causing memory back pressure to the collector
- an interruption in delivering the logs causing back pressure to the collector temporarily increasing the memory usage and cpu
- as part of the normal activity depending on the number of logs needed to process and sent and filters applied. If the collectors are started for the first time, as they will start to read all the logs available in the system
If the OpenShift Admin has not set limits and/or requests for the collector, starting in RHOL v6, it has the following default requests and limits:
"resources": {
"limits": {
"cpu": "6",
"memory": "2Gi"
},
"requests": {
"cpu": "500m",
"memory": "64Mi"
Diagnostic Steps
-
Set the environment variables
$ cr="logging-collector" $ ns="openshift-logging" -
Verify that some collector pods are in
CrashLoopBackOffor the number ofRESTARTSis not equal to0$ oc get pods -l app.kubernetes.io/instance=$cr -n $ns NAME READY STATUS RESTARTS AGE logging-collector-5k2v6 0/1 OOMKilled 4 3m logging-collector-5w845 1/1 Running 0 3m logging-collector-7ndr2 0/1 CrashLoopBackOff 4 3m logging-collector-8tnkc 0/1 CrashLoopBackOff 4 3m logging-collector-9frsc 1/1 Running 0 3m logging-collector-gjxdw 0/1 CrashLoopBackOff 4 3m logging-collector-j6sq5 1/1 Running 0 3m logging-collector-pl5sw 1/1 Running 0 3m logging-collector-prdcj 1/1 Running 0 3m logging-collector-sxbtr 0/1 OOMKilled 4 3m logging-collector-xpv98 1/1 Running 0 3m logging-collector-xr5k2 0/1 OOMKilled 4 3m logging-collector-z6n6s 1/1 Running 0 3m -
Verify that the reason of being in
CrashLoopBackOffis that they wereOOMKill$ oc get pods -l app.kubernetes.io/instance=$cr -n $ns -o yaml |grep OOMKill reason: OOMKilled reason: OOMKilled reason: OOMKilled reason: OOMKilled reason: OOMKilled reason: OOMKilled reason: OOMKilled reason: OOMKilled
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.