Elasticsearch shows duplicated logs and poor performance in OpenShift Container Platform

Solution Unverified - Updated 9 Aug 2024

Environment

OpenShift Container Platform
logging-fluentd images
- 3.4.1
- 3.5.0

Issue

In Kibana it is sometime visible a high amount of logs for a specific period of time, specially in the .operations indices.
It is visible duplicated logs over the time
After the cluster has gone through a period of stress, Elasticsearch takes a lot of time to recover and indices size grow considerably
In fluentd pods logs any of these timeout messages appear frequently:

2017-09-25 16:07:41 +0200 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:13c4370" elapsed_time=36.999487413 slow_flush_log_threshold=20.0

2017-09-25 16:07:38 +0200 [warn]: Could not push logs to Elasticsearch, resetting connection and trying again. read timeout reached

2017-09-25 16:23:59 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-09-25 16:20:37 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Could not push logs to Elasticsearch after 2 retries. read timeout reached" plugin_id="object:13c4370"

Resolution

This problem has been identified and has been resolved in This content is not included.BZ#1497836

It's suggested to update to the latest logging version for your OCP version.

WORKAROUND
A manual workaround can be applied by creating a file calledpatch-fluentd.sh which will edit the logging-fluentd configmap.

#!/bin/bash

mkdir fluentd-remediation
cd fluentd-remediation
oc export cm/logging-fluentd > logging-fluentd-cm.yaml
anypod=$(oc get po --selector=component=fluentd --no-headers -o jsonpath='{range .items[?(@.status.phase=="Running")]}{.metadata.name}{"\n"}{end}' | head -n1)
oc exec $anypod -- cat /etc/fluent/configs.d/openshift/output-es-ops-config.conf > output-es-ops-config.conf
oc exec $anypod -- cat /etc/fluent/configs.d/openshift/output-es-config.conf > output-es-config.conf
oc exec $anypod -- cat /etc/fluent/configs.d/openshift/output-operations.conf > output-operations.conf
oc exec $anypod -- cat /etc/fluent/configs.d/openshift/output-applications.conf > output-applications.conf

for file in `find .`; do cp $file $file.orig; done

#create patch
cat <<EOF > fluentd-es.patch
      flush_interval "#{ENV['ES_FLUSH_INTERVAL'] || '5s'}"
      max_retry_wait "#{ENV['ES_RETRY_WAIT'] || '300'}"
      buffer_type memory
      buffer_queue_full_action block
      request_timeout 600
EOF

cat <<EOF > fluentd-ops.patch
      flush_interval "#{ENV['OPS_FLUSH_INTERVAL'] || '5s'}"
      max_retry_wait "#{ENV['OPS_RETRY_WAIT'] || '300'}"
      buffer_type memory
      buffer_queue_full_action block
      request_timeout 600
EOF

#fix output-es-*.conf
sed -e '/.*flush_interval.*/d' -i output-es-config.conf
sed -e '/.*flush_interval.*/d' -i output-es-ops-config.conf
sed -e '/.*max_retry_wait./ {' -e 'r fluentd-es.patch' -e 'd' -e '}' -i output-es-config.conf
sed -e '/.*max_retry_wait./ {' -e 'r fluentd-ops.patch' -e 'd' -e '}' -i output-es-ops-config.conf

#add content to output-applications/operations.conf
sed -e '/.*output-es-ops-config.conf.*/ {' -e 'r output-es-ops-config.conf' -e 'd' -e '}' -i output-operations.conf
sed -e '/.*output-es-config.conf.*/ {' -e 'r output-es-config.conf' -e 'd' -e '}' -i output-applications.conf

#add indentation
sed 's/^/      /' -i output-applications.conf
sed 's/^/      /' -i output-operations.conf

#add content to configmap
sed -e '/.*output-operations.conf.*/ {' -e 'r output-operations.conf' -e 'd' -e '}' -i logging-fluentd-cm.yaml
sed -e '/.*output-applications.conf.*/ {' -e 'r output-applications.conf' -e 'd' -e '}' -i logging-fluentd-cm.yaml

#replace old configmap
oc replace cm/logging-fluentd -f logging-fluentd-cm.yaml

And execute the script:

$ oc project logging
$ chmod +x patch-fluentd.sh
$ ./patch-fluentd.sh

Will set the request_timeout=600 , set the buffer_type block set the flush_interval configurable.

After that, it would be enough to set the desired flush_interval for either the ES or ES-OPS configuration with the following command (1 second is the desired value):

$ oc env daemonset/logging-fluentd ES_FLUSH_INTERVAL=1s OPS_FLUSH_INTERVAL=1s

Finally, deleting the fluentd pods will get them recreated using the new configuration from either the configmap and the daemonset.

$ oc delete pod -l component=fluentd

Root Cause

Having a 5 seconds flush_interval was causing big buffers to be sent to Elasticsearch to be persisted and indexed. Elasticsearch migh need more than 5 seconds (request_timeout) for that task causing Fluentd to consider this a timeout and retry with the same buffer.

The result of this circle will be repeated logs and every time more and more logs to be processed.

SBR

Shift

Product(s)

Red Hat OpenShift Container Platform

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.