Collector stops of log forwarding when one of the outputs reaches the queue size limit in RHOCP 4

Solution Verified - Updated 26 Aug 2024

Environment

Red Hat OpenShift Container Platform (RHOCP)
- 4
Red Hat OpenShift Logging (RHOL)
- 5
Fluentd
Vector

Issue

Fluentd doesn't send more logs to all the outputs when one of the outputs reaches the buffer size defined in totalLimitSize
Fluentd stops of reading more logs and log forwarding to all the outputs defined when one of the outputs reaches the buffer size defined in totalLimitSize
Vector stops of reading more logs

Resolution

Red Hat is aware of this issue and it was tracked in Bug This content is not included.LOG-4535 closed as Won't fix.

For avoiding the situation of the collector being blocked of reading new logs, some collector This page is not included, but the link has been rewritten to point to the nearest parent document.alerts should be present in the OpenShift Console indicating that a problem exists between the collector and the delivering of the logs: FluentDHighErrorRate, FluentdQueueLengthIncreasing or FluentDVeryHighErrorRate. These alerts should be reviewed and fixed for not reaching the totalLimitSize.

Workaround

Starting in RHOL 5.8 is possible to isolate the inputs from the outputs for avoiding an issue in one of the outputs as slowness or not able to deliver the logs causes impact on the others using the multi log forwarder feature configuring one single output per clusterLogging/clusterLogForwarder.

For more information, please open a This content is not included.new support case with Red Hat Support.

Root Cause

The collector has defined only a source/input for each type of logs. This can be observed reviewing the collector configuration. Below, it's visible the source/input for reading the containers/application logs:

$  oc get cm collector-config -n openshift-logging -o json|jq -r '.data."fluent.conf"'
...
# Logs from containers (including openshift containers)
<source>
  @type tail
  @id container-input
  path "/var/log/pods/*/*/*.log"
  exclude_path ["/var/log/pods/openshift-logging_collector-*/*/*.log", "/var/log/pods/openshift-logging_logfilesmetricexporter-*/*/*.log", "/var/log/pods/openshift-logging_elasticsearch-*/*/*.log", "/var/log/pods/openshift-logging_kibana-*/*/*.log", "/var/log/pods/openshift-logging_*/loki*/*.log", "/var/log/pods/openshift-logging_*/gateway/*.log", "/var/log/pods/openshift-logging_*/opa/*.log", "/var/log/pods/*/*/*.gz", "/var/log/pods/*/*/*.tmp"]
  pos_file "/var/lib/fluentd/pos/es-containers.log.pos"
...

Independently in how the parse/filter/outputs are defined, by default, or using a custom pipeline in the clusterLogforwarder instance, all the pipelines have the same source/input and they are not independent. This causes that when an output reaches the totalLimitSize defined for the buffer, if the overflowAction is block, the default, then, it will be stopped of reading more logs impacting to all the outputs.

Diagnostic Steps

Verify the collector type is fluentd (the same happens when using the collector type Vector, but the verification process is different):

$ oc get clusterlogging instance -n openshift-logging -o jsonpath='{.spec.collection}' 
{"logs":{"type":"fluentd"}}

Verify the totalLimitSize defined, in case that not value set, the default value is 8G. For this example, it's 100m:

$ oc get clusterlogging instance -n openshift-logging -o jsonpath='{.spec.forwarder}' 
{"fluentd":{"buffer":{"totalLimitSize":"100m"}}}

Verify one of the outputs defined in the clusterLogForwarder has reached the totalLimitSize in the buffer size. In this example, the syslogtest output defined has reached the totalLimitSize of 100M:

$ for pod in $(oc get pods -l component=collector -o name -n openshift-logging); do echo -e "\n\n### $pod ###"; oc  -n openshift-logging exec $pod -- /bin/bash -c "du -khs /var/lib/fluentd/*"; done
...
### pod/collector-zhcgw ###
Defaulted container "collector" out of: collector, logfilesmetricexporter
0    /var/lib/fluentd/default
24K    /var/lib/fluentd/pos
0    /var/lib/fluentd/retry_default
102M    /var/lib/fluentd/syslogtest <--- this

Confirm checking the fluentd position files that fluentd stops of reading more logs checking that the date of the files is not updated anymore:

$ for pod in $(oc get pods -l component=collector -o name -n openshift-logging); do echo -e "\n\n### $pod ###"; oc  -n openshift-logging exec $pod -- /bin/bash -c "ls -l /var/lib/fluentd/pos"; done
...
### pod/collector-zhcgw ###
Defaulted container "collector" out of: collector, logfilesmetricexporter
total 24
-rw-------. 1 root root  65 Sep 21 07:21 acl-audit-log.pos
-rw-------. 1 root root  59 Sep 21 07:21 audit.log.pos
-rw-------. 1 root root 823 Sep 21 07:22 es-containers.log.pos
-rw-------. 1 root root 139 Sep 21 07:22 journal_pos.json
-rw-------. 1 root root  68 Sep 21 07:22 kube-apiserver.audit.log.pos
-rw-------. 1 root root 208 Sep 21 07:22 oauth-apiserver.audit.log

Verify the same from the OCP Console log in and go to Observe > Metrics and run the query:

sum by (plugin_id)(irate(fluentd_output_status_emit_count{plugin_id!~'object:.+'}[5m]))

In case that desired to review for an specific collector, run the below query replacing the name of the collector collector-4ztzr used in the example for the desired to review:

sum by (plugin_id)(irate(fluentd_output_status_emit_count{hostname=~"collector-4ztzr",plugin_id!~'object:.+'}[5m]))

SBR

Product(s)

Components

Fluentd

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.