Elasticsearch shows duplicated logs and poor performance in OpenShift Container Platform
Environment
- OpenShift Container Platform
- logging-fluentd images
- 3.4.1
- 3.5.0
Issue
- In Kibana it is sometime visible a high amount of logs for a specific period of time, specially in the
.operationsindices. - It is visible duplicated logs over the time
- After the cluster has gone through a period of stress, Elasticsearch takes a lot of time to recover and indices size grow considerably
- In fluentd pods logs any of these timeout messages appear frequently:
2017-09-25 16:07:41 +0200 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:13c4370" elapsed_time=36.999487413 slow_flush_log_threshold=20.0
2017-09-25 16:07:38 +0200 [warn]: Could not push logs to Elasticsearch, resetting connection and trying again. read timeout reached
2017-09-25 16:23:59 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-09-25 16:20:37 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Could not push logs to Elasticsearch after 2 retries. read timeout reached" plugin_id="object:13c4370"
Resolution
This problem has been identified and has been resolved in This content is not included.BZ#1497836
It's suggested to update to the latest logging version for your OCP version.
WORKAROUND
A manual workaround can be applied by creating a file calledpatch-fluentd.sh which will edit the logging-fluentd configmap.
#!/bin/bash
mkdir fluentd-remediation
cd fluentd-remediation
oc export cm/logging-fluentd > logging-fluentd-cm.yaml
anypod=$(oc get po --selector=component=fluentd --no-headers -o jsonpath='{range .items[?(@.status.phase=="Running")]}{.metadata.name}{"\n"}{end}' | head -n1)
oc exec $anypod -- cat /etc/fluent/configs.d/openshift/output-es-ops-config.conf > output-es-ops-config.conf
oc exec $anypod -- cat /etc/fluent/configs.d/openshift/output-es-config.conf > output-es-config.conf
oc exec $anypod -- cat /etc/fluent/configs.d/openshift/output-operations.conf > output-operations.conf
oc exec $anypod -- cat /etc/fluent/configs.d/openshift/output-applications.conf > output-applications.conf
for file in `find .`; do cp $file $file.orig; done
#create patch
cat <<EOF > fluentd-es.patch
flush_interval "#{ENV['ES_FLUSH_INTERVAL'] || '5s'}"
max_retry_wait "#{ENV['ES_RETRY_WAIT'] || '300'}"
buffer_type memory
buffer_queue_full_action block
request_timeout 600
EOF
cat <<EOF > fluentd-ops.patch
flush_interval "#{ENV['OPS_FLUSH_INTERVAL'] || '5s'}"
max_retry_wait "#{ENV['OPS_RETRY_WAIT'] || '300'}"
buffer_type memory
buffer_queue_full_action block
request_timeout 600
EOF
#fix output-es-*.conf
sed -e '/.*flush_interval.*/d' -i output-es-config.conf
sed -e '/.*flush_interval.*/d' -i output-es-ops-config.conf
sed -e '/.*max_retry_wait./ {' -e 'r fluentd-es.patch' -e 'd' -e '}' -i output-es-config.conf
sed -e '/.*max_retry_wait./ {' -e 'r fluentd-ops.patch' -e 'd' -e '}' -i output-es-ops-config.conf
#add content to output-applications/operations.conf
sed -e '/.*output-es-ops-config.conf.*/ {' -e 'r output-es-ops-config.conf' -e 'd' -e '}' -i output-operations.conf
sed -e '/.*output-es-config.conf.*/ {' -e 'r output-es-config.conf' -e 'd' -e '}' -i output-applications.conf
#add indentation
sed 's/^/ /' -i output-applications.conf
sed 's/^/ /' -i output-operations.conf
#add content to configmap
sed -e '/.*output-operations.conf.*/ {' -e 'r output-operations.conf' -e 'd' -e '}' -i logging-fluentd-cm.yaml
sed -e '/.*output-applications.conf.*/ {' -e 'r output-applications.conf' -e 'd' -e '}' -i logging-fluentd-cm.yaml
#replace old configmap
oc replace cm/logging-fluentd -f logging-fluentd-cm.yaml
And execute the script:
$ oc project logging
$ chmod +x patch-fluentd.sh
$ ./patch-fluentd.sh
Will set the request_timeout=600 , set the buffer_type block set the flush_interval configurable.
After that, it would be enough to set the desired flush_interval for either the ES or ES-OPS configuration with the following command (1 second is the desired value):
$ oc env daemonset/logging-fluentd ES_FLUSH_INTERVAL=1s OPS_FLUSH_INTERVAL=1s
Finally, deleting the fluentd pods will get them recreated using the new configuration from either the configmap and the daemonset.
$ oc delete pod -l component=fluentd
Root Cause
Having a 5 seconds flush_interval was causing big buffers to be sent to Elasticsearch to be persisted and indexed. Elasticsearch migh need more than 5 seconds (request_timeout) for that task causing Fluentd to consider this a timeout and retry with the same buffer.
The result of this circle will be repeated logs and every time more and more logs to be processed.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.