Fluentd - Bulk index queue is full, retrying in OpenShift
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 3
- 4
Issue
-
Getting ElasticsearchErrorHandler BulkIndexQueueFull messages
-
Fluentd won't push to Elasticsearch
Aug 31 04:07:40 node.example.com dockerd-current[6060]: 2018-08-31 04:07:40 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2018-08-31 04:07:19 +0200 error_ class="Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull" error="Bulk index queue is full, retrying" plugin_id="object:c7c4dc" -
Logs missing in Kibana
-
Buffer flush took longer time messages
2019-09-25 01:45:26 -0700 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:3ea119ad7d58" elapsed_time=22.682930512 slow_flush_log_threshold=20.0
Resolution
For OpenShift 4
-
When using the OpenShift Logging log store, increase the This page is not included, but the link has been rewritten to point to the nearest parent document.memory requests for the log store by editing the
ClusterLoggingcustom resource (CR) in theopenshift-loggingproject:$ oc edit ClusterLogging instance [...] apiVersion: "logging.openshift.io/v1" kind: "ClusterLogging" metadata: name: "instance" [...] spec: logStore: type: "elasticsearch" elasticsearch: resources: limits: cpu: "2" memory: "16Gi" # HERE requests: cpu: "2" # HERE memory: "16Gi" # HERE [...] -
If using an external log store via
ClusterLogForwarder, review the external log store resources, the network bandwidth and connectivity, and any potential issue that could cause Fluentd to not being able to send logs to that log store. Note that Fluentd is not able to send logs to any of the configured outputs when the buffer size of any of them is full as explained in collector stops of log forwarding when one of the outputs reaches the queue size limit in RHOCP 4.
For OpenShift 3
-
Move to the logging project - this could be either
loggingoropenshift-logging. NOTE: This could vary in your setup as older RHOCP 3.x versions allowed tweaking the parameter. The example here showsopenshift-logging$ oc project openshift-logging -
Scale down each DC generating ES to start the changes. NOTE: This will stop Elasticsearch from working. This can be avoided if your nodes have sufficient capacity to schedule both the old and the new Pods at the same time:
$ for DC in $(oc get dc -l component=es -oname); do oc scale "${DC}" --replicas=0; done -
Set the resources limits/requests. NOTE: Memory limits/resources should match so ES will have guaranteed RAM from RHOCP at any given time
$ for DC in $(oc get dc -l component=es -oname); do oc set resources "${DC}" -c=elasticsearch --limits=cpu=2,memory=16Gi --requests=cpu=2,memory=16Gi ; done -
Set the
INSTANCE_RAMvariable to the same value of memory limits/requests$ for DC in $(oc get dc -l component=es -oname); do oc set env "${DC}" -c=elasticsearch INSTANCE_RAM=16Gi --overwrite ; done -
Double check about the updated values in Check if the values were captured by describing the DeploymentConfigs:
$ for DC in $(oc get dc -l component=es -oname); do oc describe "${DC}"; done -
Create a new job ReplicationController with the updated values
$ for DC in $(oc get dc -l component=es -oname); do oc rollout latest "${DC}"; done -
Scale up the ES pods:
$ for DC in $(oc get dc -l component=es -oname); do oc scale "${DC}" --replicas=1; done
Root Cause
-
This is usually caused by Elasticsearch processing logs too slowly
-
Elasticsearch is a resource-intensive application and benefits from:
- Additional CPU (recommendation is 2 minimum, but, it could be needed more since the number of CPUs allocated is in relation with the thread pools for reading,writing,bulk,searches, etc operations in Elasticsearch)
- PV's that have high IO characteristics (NFS not supported and recommended dedicated block storage)
- Proportional amount of memory to support the desired retention and number of services on the OCP cluster
-
We recommend a bare minimum of 16GiB. Note: Larger clusters might need up to 64GiB
-
The best way to improve Elasticsearch performance is by increasing memory. This can be done by both tweaking the
resources.limits/requestsresource (for RHOCP 3.x also theINSTANCE_RAMenvironment variable in theelasticsearchcontainer needs tweaking) -
The
resources.requests.cpuforelasticsearchcontainer should be set to the real cpu usage by the Elasticsearch pods and the content of the article Elasticsearch alert AggregatedLoggingSystemCPUHigh in RHOCP 4. The article also applies to RHOCP 3.x.
Diagnostic Steps
-
Check the elasticsearch logs for other explicit errors
$ oc project openshift-logging $ oc logs <elasticsearch pod> -
More logs can be gathered including runtime logs which are not available with the oc logs command.
-
For RHOCP 3.x
-
Please review the KCS Logging dump script for RHOCP 3.x
-
In case support is needed please open up a case and attach the result of the script from above KCS
-
-
For RHOCP 4.x
-
Please review the KCS Logging must-gather for RHOCP 4.x
-
In case support is needed please open up a case and attach the Openshift Logging Must-Gather by following KCS Creating must-gather with more details for specific components in OCP 4 at Data Collection for Red Hat OpenShift Logging section
-
-
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.