Fluentd - Bulk index queue is full, retrying in OpenShift

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 3
    • 4

Issue

  • Getting ElasticsearchErrorHandler BulkIndexQueueFull messages

  • Fluentd won't push to Elasticsearch

    Aug 31 04:07:40 node.example.com dockerd-current[6060]: 2018-08-31 04:07:40 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2018-08-31 04:07:19 +0200 error_    class="Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull" error="Bulk index queue is full, retrying" plugin_id="object:c7c4dc"
    
  • Logs missing in Kibana

  • Buffer flush took longer time messages

    2019-09-25 01:45:26 -0700 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:3ea119ad7d58" elapsed_time=22.682930512 slow_flush_log_threshold=20.0
    

Resolution

For OpenShift 4

For OpenShift 3

  • Move to the logging project - this could be either logging or openshift-logging. NOTE: This could vary in your setup as older RHOCP 3.x versions allowed tweaking the parameter. The example here shows openshift-logging

    $ oc project openshift-logging
    
  • Scale down each DC generating ES to start the changes. NOTE: This will stop Elasticsearch from working. This can be avoided if your nodes have sufficient capacity to schedule both the old and the new Pods at the same time:

    $ for DC in $(oc get dc -l component=es -oname); do oc scale "${DC}" --replicas=0; done
    
  • Set the resources limits/requests. NOTE: Memory limits/resources should match so ES will have guaranteed RAM from RHOCP at any given time

    $ for DC in $(oc get dc -l component=es -oname); do oc set resources "${DC}" -c=elasticsearch --limits=cpu=2,memory=16Gi --requests=cpu=2,memory=16Gi ; done
    
  • Set the INSTANCE_RAM variable to the same value of memory limits/requests

    $ for DC in $(oc get dc -l component=es -oname); do oc set env "${DC}" -c=elasticsearch INSTANCE_RAM=16Gi --overwrite ; done
    
  • Double check about the updated values in Check if the values were captured by describing the DeploymentConfigs:

    $ for DC in $(oc get dc -l component=es -oname); do oc describe "${DC}"; done
    
  • Create a new job ReplicationController with the updated values

    $ for DC in $(oc get dc -l component=es -oname); do oc rollout latest "${DC}"; done
    
  • Scale up the ES pods:

    $ for DC in $(oc get dc -l component=es -oname); do oc scale "${DC}" --replicas=1; done
    

Root Cause

  • This is usually caused by Elasticsearch processing logs too slowly

  • Elasticsearch is a resource-intensive application and benefits from:

    • Additional CPU (recommendation is 2 minimum, but, it could be needed more since the number of CPUs allocated is in relation with the thread pools for reading,writing,bulk,searches, etc operations in Elasticsearch)
    • PV's that have high IO characteristics (NFS not supported and recommended dedicated block storage)
    • Proportional amount of memory to support the desired retention and number of services on the OCP cluster
  • We recommend a bare minimum of 16GiB. Note: Larger clusters might need up to 64GiB

  • The best way to improve Elasticsearch performance is by increasing memory. This can be done by both tweaking the resources.limits/requests resource (for RHOCP 3.x also the INSTANCE_RAM environment variable in the elasticsearch container needs tweaking)

  • The resources.requests.cpu for elasticsearch container should be set to the real cpu usage by the Elasticsearch pods and the content of the article Elasticsearch alert AggregatedLoggingSystemCPUHigh in RHOCP 4. The article also applies to RHOCP 3.x.

Diagnostic Steps

  • Check the elasticsearch logs for other explicit errors

    $ oc project openshift-logging
    $ oc logs <elasticsearch pod> 
    
  • More logs can be gathered including runtime logs which are not available with the oc logs command.

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.