Fluentd - Bulk index queue is full, retrying in OpenShift

Solution Verified - Updated 10 Dec 2024

Environment

Red Hat OpenShift Container Platform (RHOCP)
- 3
- 4

Issue

Getting ElasticsearchErrorHandler BulkIndexQueueFull messages

Fluentd won't push to Elasticsearch

Aug 31 04:07:40 node.example.com dockerd-current[6060]: 2018-08-31 04:07:40 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2018-08-31 04:07:19 +0200 error_    class="Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull" error="Bulk index queue is full, retrying" plugin_id="object:c7c4dc"

Logs missing in Kibana

Buffer flush took longer time messages

2019-09-25 01:45:26 -0700 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:3ea119ad7d58" elapsed_time=22.682930512 slow_flush_log_threshold=20.0

Resolution

For OpenShift 4

When using the OpenShift Logging log store, increase the This page is not included, but the link has been rewritten to point to the nearest parent document.memory requests for the log store by editing the ClusterLogging custom resource (CR) in the openshift-logging project:

$ oc edit ClusterLogging instance
[...]
apiVersion: "logging.openshift.io/v1"
kind: "ClusterLogging"
metadata:
  name: "instance"
[...]
spec:
    logStore:
      type: "elasticsearch"
      elasticsearch:
        resources: 
          limits:
            cpu: "2"
            memory: "16Gi"  # HERE
          requests:
            cpu: "2"        # HERE
            memory: "16Gi"  # HERE
[...]

If using an external log store via ClusterLogForwarder, review the external log store resources, the network bandwidth and connectivity, and any potential issue that could cause Fluentd to not being able to send logs to that log store. Note that Fluentd is not able to send logs to any of the configured outputs when the buffer size of any of them is full as explained in collector stops of log forwarding when one of the outputs reaches the queue size limit in RHOCP 4.

For OpenShift 3

Move to the logging project - this could be either logging or openshift-logging. NOTE: This could vary in your setup as older RHOCP 3.x versions allowed tweaking the parameter. The example here shows openshift-logging
```
$ oc project openshift-logging
```
Scale down each DC generating ES to start the changes. NOTE: This will stop Elasticsearch from working. This can be avoided if your nodes have sufficient capacity to schedule both the old and the new Pods at the same time:
```
$ for DC in $(oc get dc -l component=es -oname); do oc scale "${DC}" --replicas=0; done
```

Set the resources limits/requests. NOTE: Memory limits/resources should match so ES will have guaranteed RAM from RHOCP at any given time

$ for DC in $(oc get dc -l component=es -oname); do oc set resources "${DC}" -c=elasticsearch --limits=cpu=2,memory=16Gi --requests=cpu=2,memory=16Gi ; done

Set the INSTANCE_RAM variable to the same value of memory limits/requests

$ for DC in $(oc get dc -l component=es -oname); do oc set env "${DC}" -c=elasticsearch INSTANCE_RAM=16Gi --overwrite ; done

Double check about the updated values in Check if the values were captured by describing the DeploymentConfigs:
```
$ for DC in $(oc get dc -l component=es -oname); do oc describe "${DC}"; done
```

Create a new job ReplicationController with the updated values

$ for DC in $(oc get dc -l component=es -oname); do oc rollout latest "${DC}"; done

Scale up the ES pods:

$ for DC in $(oc get dc -l component=es -oname); do oc scale "${DC}" --replicas=1; done

Root Cause

This is usually caused by Elasticsearch processing logs too slowly
Elasticsearch is a resource-intensive application and benefits from:
- Additional CPU (recommendation is 2 minimum, but, it could be needed more since the number of CPUs allocated is in relation with the thread pools for reading,writing,bulk,searches, etc operations in Elasticsearch)
- PV's that have high IO characteristics (NFS not supported and recommended dedicated block storage)
- Proportional amount of memory to support the desired retention and number of services on the OCP cluster
We recommend a bare minimum of 16GiB. Note: Larger clusters might need up to 64GiB
The best way to improve Elasticsearch performance is by increasing memory. This can be done by both tweaking the resources.limits/requests resource (for RHOCP 3.x also the INSTANCE_RAM environment variable in the elasticsearch container needs tweaking)
The resources.requests.cpu for elasticsearch container should be set to the real cpu usage by the Elasticsearch pods and the content of the article Elasticsearch alert AggregatedLoggingSystemCPUHigh in RHOCP 4. The article also applies to RHOCP 3.x.

Diagnostic Steps

Check the elasticsearch logs for other explicit errors

$ oc project openshift-logging
$ oc logs <elasticsearch pod>

More logs can be gathered including runtime logs which are not available with the oc logs command.
- For RHOCP 3.x
  - Please review the KCS Logging dump script for RHOCP 3.x
  - In case support is needed please open up a case and attach the result of the script from above KCS
- For RHOCP 4.x
  - Please review the KCS Logging must-gather for RHOCP 4.x
  - In case support is needed please open up a case and attach the Openshift Logging Must-Gather by following KCS Creating must-gather with more details for specific components in OCP 4 at Data Collection for Red Hat OpenShift Logging section

SBR

Shift

Product(s)

Components

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.