Logging Collector Pods being restarted with OOMKill in RHOCP 4

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4
  • Red Hat OpenShift Logging (RHOL)
    • 6
  • Vector
  • Collector pods

Issue

  • Logging collector pods are experiencing CrashLoopBackOff issue.
  • Multiple logging collector pods are in CrashLoopBackOff state.
  • The number of pods in CrashLoopBackOff state is increasing
  • It was just configured the Red Hat Logging stack and the collector pods are in CrashLoopBackOff

Resolution

Delays or failing to the deliver the logs to the destination

Review if any errors are in the collector pods:

  1. Go to the OpenShift Console > Observe > Dashboards > Dashboard: Logging / Collection and review the Total errors last 60m Dashboard if errors are present that could justify the back pressure

  2. Review in the collector logs itself for errors that could indicate any issue delivering the logs to the destinations

    $ for pod in $(oc get pods -l app.kubernetes.io/component=collector -o name -n <namespace> ); do oc logs $pod -n <namespace> |grep -i error; done
    

If errors exist that indicate failures in delivering the logs, this should be fixed first.

As part of the normal activity depending on the number of logs needed to process and send and filters applied

Depending on the number of logs produced and the filters applied, the memory and cpu usage can be different from one collector to other. If the memory limit is hit leading to the pod to CrashLoopBackOff caused by OOMKill, at least that it's reduced the number of logs read and filtered, the solution is to increase the limits.memory.

For increasing the limits.memory following the Red Hat Documentation Section "Configure log collector CPU and memory limits".

Note: when the collector pods are configured for the first time in the system, they will need to read all the logs available in the system from days ago. This could generate extra pressure in the memory and cpu that could be reduced when all these logs from the past are read and it becomes only needed to read the current logs produced. After this peak in memory and cpu usage, it could be considered to reduce the limits.cpu and limits.memory verifying the normal cpu and memory used as indicated in the Red Hat Knowledge Article "Troubleshooting the Vector collector in RHOCP 4"

Root Cause

The collectors memory usage can be caused by:

  • slowness delivering the logs causing memory back pressure to the collector
  • an interruption in delivering the logs causing back pressure to the collector temporarily increasing the memory usage and cpu
  • as part of the normal activity depending on the number of logs needed to process and sent and filters applied. If the collectors are started for the first time, as they will start to read all the logs available in the system

If the OpenShift Admin has not set limits and/or requests for the collector, starting in RHOL v6, it has the following default requests and limits:

      "resources": {
        "limits": {
          "cpu": "6",
          "memory": "2Gi"
        },
        "requests": {
          "cpu": "500m",
          "memory": "64Mi"

Diagnostic Steps

  1. Set the environment variables

    $ cr="logging-collector"
    $ ns="openshift-logging"
    
  2. Verify that some collector pods are in CrashLoopBackOff or the number of RESTARTS is not equal to 0

    $  oc get pods -l app.kubernetes.io/instance=$cr -n $ns
    NAME                      READY   STATUS             RESTARTS   AGE
    logging-collector-5k2v6   0/1     OOMKilled          4          3m
    logging-collector-5w845   1/1     Running            0          3m
    logging-collector-7ndr2   0/1     CrashLoopBackOff   4          3m
    logging-collector-8tnkc   0/1     CrashLoopBackOff   4          3m
    logging-collector-9frsc   1/1     Running            0          3m
    logging-collector-gjxdw   0/1     CrashLoopBackOff   4          3m
    logging-collector-j6sq5   1/1     Running            0          3m
    logging-collector-pl5sw   1/1     Running            0          3m
    logging-collector-prdcj   1/1     Running            0          3m
    logging-collector-sxbtr   0/1     OOMKilled          4          3m
    logging-collector-xpv98   1/1     Running            0          3m
    logging-collector-xr5k2   0/1     OOMKilled          4          3m
    logging-collector-z6n6s   1/1     Running            0          3m
    
  3. Verify that the reason of being in CrashLoopBackOff is that they were OOMKill

    $ oc get pods -l app.kubernetes.io/instance=$cr -n $ns -o yaml  |grep OOMKill
          reason: OOMKilled
          reason: OOMKilled
          reason: OOMKilled
          reason: OOMKilled
          reason: OOMKilled
          reason: OOMKilled
          reason: OOMKilled
          reason: OOMKilled
    
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.