Troubleshooting OpenShift Container Platform 4: Cluster Logging with Elasticsearch and Fluentd

Updated 7 Jan 2025

This article is for troubleshooting OpenShift Container Platform 4.x Cluster Logging only. For 3.x troubleshooting, see this related article

This article is part of the OpenShift Container Platform 4.x troubleshooting series

EFK Components

kibana:
- Kibana application (kibana container) which is a NodeJS-based UI tool used to visualize logs stored in Elasticsearch. Kibana will query Elasticsearch using the rest API and provide user-scoped configuration.
- Kibana proxy (kibana-proxy container) is a NodeJS application that will intercept all user requests and provide integration with OCP Single Sign-On
elasticsearch: This is the Content from www.elastic.co is not included.Elasticsearch cluster, where logs are sent from Fluentd pods and persisted. Elasticsearch is configured with SearchGuard plugin that provides user-scoped authorization.
fluentd: Content from www.fluentd.org is not included.Fluentd is an application (formally a data collector) that is deployed on each node (matching the nodeSelectors) and will use some additional privileges to gather logs from the running containers and send them to Elasticsearch
index-management: The elasticsearch-im pods run on a cronjob. They rollover indices and delete old indices to free disk space on Elasticsearch.
logging-curator: The Curator can be configured to periodically clean up old logs and free disk space on Elasticsearch. It will communicate with Elasticsearch using the REST API as the rest of the components. This is only used for indices created before OpenShift 4.5

Index

Basic maintenance operations

Querying Elasticsearch can usually be done by accessing any node of the cluster. As the API is protected by the use of certificates, in order to have admin access, the Elasticsearch certificate must be used. They are kept in /etc/elasticsearch/secret/. An example of the Elasticsearch API is the following:

Export the ElasticSearch pod:
$ es_pod=$(oc get pod --selector=component=elasticsearch --no-headers -o jsonpath='{range .items[?(@.status.phase=="Running")]}{.metadata.name}{"\n"}{end}' | head -n1)

Run the API query
$ oc exec -c elasticsearch $es_pod -- curl -s --key /etc/elasticsearch/secret/admin-key --cert /etc/elasticsearch/secret/admin-cert --cacert /etc/elasticsearch/secret/admin-ca https://localhost:9200/<API_CALL>

These API queries can be simplified using the built-in utility, es_util

$ oc exec -c elasticsearch $es_pod -- es_util --query=<API_CALL>

Scaling up/down the ES deployments

=> Scale down
$ for pod in `oc get deployment.apps -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do oc scale deployment.apps/$pod --replicas=0; done

=> Scale up
$ for pod in `oc get deployment.apps -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do oc scale deployment.apps/$pod --replicas=1; done

Elasticsearch health

Get cluster health. Cluster health is a state of the global state of all the nodes and indices.
- RED state means that at least one index is in RED state (it's primary shard has not been recovered) or that the minimum amount of Elasticsearch nodes has not reached (i.e. QUORUM). Writes are not allowed, cluster is not accessible.
- YELLOW state means that all indices are at least in this same state, which means that at least one replica shard is not assigned/recovered.
- GREEN: All ES nodes are running and all indices are fully recovered.
  The health query will show things like the pending tasks in case it is recovering, the number of shards, the percent of active shards and the unassigned shards.

$ oc exec -c elasticsearch $es_pod -- es_util --query=_cat/health?v
epoch      timestamp cluster    status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1502093481 08:11:21  logging-es green           3         3     11   5    0    0        0             0                  -                100.0%

Nodes status. Each ElasticSearch instance (pod) is internally called a node. This command will show basic information about all the nodes forming the cluster, who is the master (marked with *), their name and also the percent of RAM used out of the total available along with the HEAP percent used out of the total RAM used. Dont get confused when you see high values of RAM percent because this is the maximum available to be used by the HEAP. An indicator of being low in memory is having a high HEAP percent close to the ram.percent which will require allocating more RAM.

$ oc exec -c elasticsearch $es_pod -- es_util --query=_cat/nodes?v
host       ip         heap.percent ram.percent load node.role master name
10.0.0.1  10.0.0.1            34          84 1.58 d         m      elasticsearch-1-example
10.0.0.2  10.0.0.2            41          99 2.11 d         m      elasticsearch-2-example
10.0.0.3  10.0.0.3            36          99 1.35 d         m      elasticsearch-3-example

Elasticsearch indices

List indices

$ oc exec -c elasticsearch $es_pod -- es_util --query=_cat/indices?v

List indices along with creation date

oc exec -c elasticsearch $es_pod -- es_util --query=_cat/indices?h=health,status,index,id,pri,rep,docs.count,docs.deleted,store.size,creation.date.string

Inspect documents inside the indices:

$ oc exec $es_pod -- es_util --query='$index_name-XXX/_search?pretty'

Delete an index. In case an index is corrupted or simply no longer necessary it can be deleted.

$ oc exec -c elasticsearch $es_pod -- es_util --query=<index_name> -XDELETE

Change the number of replicas. After changing the global number of desired replicas by modifying the ClusterLogging instance, the change will only affect newly created indices. In order to change the number of replicas of one index the following command can be used:

$ oc exec -c elasticsearch $es_pod -- es_util --query=<index_name>/_settings -d '{ "index" : { "number_of_replicas" : 2 } }' -XPUT
### Change number of replicas to ALL indices
$ oc exec -c elasticsearch $es_pod -- es_util --query=*/_settings -d '{ "index" : { "number_of_replicas" : 2 } }' -XPUT

Special indices: By default the indices API query will show the health, status, documents and size data about each index. It is useful to know what to expect here. Some relevant index names are:
- .kibana - Contains user information like predefined searches, dashboards and other settings
- .security - Is a metadata index used by Elasticsearch
- app-XXXXXX - Contains the application pod logs
- infra-XXXXXX - Contains OpenShift infrastructure logs
- audit-XXXXXX - Contains the cluster audit logs
In OpenShift 4.5 and earlier, the below indices may also be in the index list:
- .searchguard.<pod_name> - Contains security information regarding SearchGuard initialization. It MUST contain 5 documents
- .operations.YYYY.MM.DD - Logs coming from "operations" projects like "default". This will only appear on logging-ops cluster if it exists
- project.<project_name>.<uuid>.YYYY.MM.DD - Project related index created by day containing all logs related to that project for that day

$ oc exec -c elasticsearch $es_pod -- es_util --query=_cat/indices?v
health status index                                                                pri rep docs.count docs.deleted store.size pri.store.size
green  open   app-000005   1   0        926            0    405.3kb        405.3kb
green  open   .searchguard                                                         1   2          5            0     83.1kb         27.7kb
green  open   .kibana                                                              1   0          1            0        3kb            3kb
green  open   app-000006   1   0        155            0    158.9kb        158.9kb

Shards. Each index is made up of 1+ primary shards and 0+ replica shards. The shards API shows where they are stored, whether they are primary or replica shards (p/s)

$ oc exec -c elasticsearch $es_pod -- es_util --query=_cat/shards?v
index                                                           shard prirep state   docs   store ip           node
app-000006                                          0     p      STARTED  472 370.7kb 10.128.1.184 elasticsearch-1-example
app-000006                                          0     r      STARTED  472 390.2kb 10.129.0.254 elasticsearch-2-example

Unassigned shards. When there are some unassigned shards, the health of the cluster will be affected. The same shards API is used but the columns displayed can be customized.

$ oc exec -c elasticsearch $es_pod -- es_util --query=_cat/shards?h=index,shard,prirep,state,unassigned.reason,node | grep UNASSIGNED
 app-000001                  0 r UNASSIGNED NODE_LEFT
infra-000001                1 r UNASSIGNED NODE_LEFT
infra-000001                2 r UNASSIGNED NODE_LEFT
audit-000001                0 r UNASSIGNED NODE_LEFT

Pending tasks. Elasticsearch defines tasks for any cluster-level change (e.g. create index, update mapping, allocate or fail shard) which is still to be executed.

$ oc exec -c elasticsearch $es_pod -- es_util --query=_cluster/pending_tasks
{
   "tasks": [
      {
         "insert_order": 101,
         "priority": "URGENT",
         "source": "create-index [foo_9], cause [api]",
         "time_in_queue_millis": 86,
         "time_in_queue": "86ms"
      },
      {
         "insert_order": 46,
         "priority": "HIGH",
         "source": "shard-started ([foo_2][1], node[tMTocMvQQgGCkj7QDHl3OA], [P], s[INITIALIZING]), reason [after recovery from shard_store]",
         "time_in_queue_millis": 842,
         "time_in_queue": "842ms"
      },
  ]
}

Recovery. Is a view of index shard recoveries, both on-going and previously completed. A recovery event occurs anytime an index shard moves to a different node in the cluster. This can happen during a snapshot recovery, a change in replication level, node failure, or on node startup. This last type is called a local store recovery and is the normal way for shards to be loaded from disk when a node starts up.

$ oc exec -c elasticsearch $es_pod -- es_util --query=_cat/recovery?v
index                   shard time type  stage source_host source_node target_host target_node repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
.operations.2050.09.20  0     13ms store done  n/a         n/a         node0       node-0      n/a        n/a      0     0               100%          13          0     0               100%          9928        0            0                      100.0%

Thread pool. If Elasticsearch is silently dropping records due to bulk index rejections the bulk.rejected will not be zero.

$ oc exec -c elasticsearch $es_pod -- es_util --query=_cat/thread_pool?v\&h=host,bulk.completed,bulk.rejected,bulk.queue,bulk.active,bulk.queueSize
host        ip          bulk.active bulk.queue bulk.rejected index.active index.queue index.rejected search.active search.queue search.rejected
192.168.1.100 192.168.1.100           0          0             3            0           0              0             0            0               0
192.168.1.101 192.168.1.101           0          0             0            0           0              0             0            0               0
192.168.1.102 192.168.1.102           0          0             0            0           0              0             0            0               0

Troubleshoot unassigned shards

Troubleshooting and restoring unassigned shards requires some deep knowledge about how Lucene and Elasticsearch works but here are some error-solution approaches for the most common cases.

For a quick explanation of the status and cause of unassigned shards, use the explain API:

$ oc exec -c elasticsearch $es_pod  -- es_util --query=_cluster/allocation/explain?pretty

In many cases, the output will explain the issue and/or next steps, or further research can be done on the cause.

For a more hands-on investigation, or to investigate individual shards, the first thing to do is to gather the real status of the indices and shards. Refer to the "unassigned shards" query in order to identify them

$ oc exec -c elasticsearch $es_pod -- es_util --query=_cat/shards?h=index,shard,prirep,state,unassigned.reason,node | grep UNASSIGNED
.operations.2050.05.28                                                                    0 p UNASSIGNED CLUSTER_RECOVERED

The reason will tell why they are not assigned but it's not easy to know what to do in order to recover them, and if it is even possible.

Reason: `NODE_LEFT`

Can be that for some reason the cluster had an extra node and this shard was assigned to that node.

Elasticsearch pod scaled to more than 1 by mistake
The Elasticsearch cluster for some reason now has one node less
The persistent storage has been purged/removed/lost

If one Elasticsearch pod is in CrashLoopBackOff, or running but with not all containers running, or otherwise not in a healthy state, this can usually be solved by recovering the Elasticsearch pod. It may take some time for Elasticsearch to recover.

If the persistent volume is lost and cannot be recovered, this is solved by rerouting the shard to the "new" node (pay attention to the index name, the shard id and the target node):

$ oc rsh -c elasticsearch $es_pod
$ curl --key /etc/elasticsearch/secret/admin-key --cert /etc/elasticsearch/secret/admin-cert --cacert /etc/elasticsearch/secret/admin-ca -XPOST 'https://localhost:9200/_cluster/reroute' -d '{
"commands" : [ {
  "allocate" : {
  "index" : "app-000006",
  "shard" : 0,
  "node" : "10.10.10.101",
  "allow_primary" : true
  }
}
]
}'

This can also be solved by changing the number of replicas to 0 and then to the expected number

For further steps on rerouting shards, see this related solution

Reason: `CLUSTER_RECOVERED`

This indicates that while Elasticsearch has recovered and is running, some shards exist that have not been assigned to a node. This frequently occurs for indices that cover periods of time when Elasticsearch was not running, or indices which have been pruned of data (due to age or some other reason) but were not removed from the ES metadata. This can also be caused if data is missing from the persistent storage.

For indices that are not expected to exist anymore (or at all), the indices can be deleted. If the index is expected to be working (i.e. if it was working previously), Elasticsearch should recover the data on its own after some time.

Other UNASSIGNED reasons

An shard can be in an unassigned status either because

The data is corrupted
The data is not there
The shard metadata is corrupted or missing
For the first two cases, if there is no other shard with the data, there is little that can be done and the shard must be deleted. In order to check whether the data is there and is fine, we need to access the persistent storage (data can be under 0 or 1, for many reasons not yet covered by this guide):

$ oc exec -c elasticsearch $es_pod -- ls -lR /elasticsearch/persistent/logging-es/data/logging-es/nodes/0/indices/$index
app-000006:
total 0
drwxr-xr-x. 5 1000 root 49 Oct 18 11:17 0
drwxr-xr-x. 2 1000 root 25 Oct 18 11:17 _state

app-000006/0:
total 0
drwxr-xr-x. 2 1000 root  25 Oct 18 11:17 _state
drwxr-xr-x. 2 1000 root 206 Oct 18 11:22 index
drwxr-xr-x. 2 1000 root  49 Oct 18 11:17 translog

app-000006/0/_state:
total 4
-rw-r--r--. 1 1000 root 126 Oct 18 11:17 state-25.st

app-000006/0/index:
total 124
-rw-r--r--. 1 1000 root   363 Oct  3 07:28 _0.cfe
-rw-r--r--. 1 1000 root 19564 Oct  3 07:28 _0.cfs
-rw-r--r--. 1 1000 root   374 Oct  3 07:28 _0.si
-rw-r--r--. 1 1000 root   363 Oct  3 07:28 _1.cfe
-rw-r--r--. 1 1000 root 19904 Oct  3 07:28 _1.cfs
-rw-r--r--. 1 1000 root   374 Oct  3 07:28 _1.si
-rw-r--r--. 1 1000 root   363 Oct  3 07:28 _2.cfe
-rw-r--r--. 1 1000 root 13780 Oct  3 07:28 _2.cfs
-rw-r--r--. 1 1000 root   374 Oct  3 07:28 _2.si
-rw-r--r--. 1 1000 root   363 Oct  3 07:28 _3.cfe
-rw-r--r--. 1 1000 root 29226 Oct  3 07:28 _3.cfs
-rw-r--r--. 1 1000 root   374 Oct  3 07:28 _3.si
-rw-r--r--. 1 1000 root   410 Oct 18 11:22 segments_o
-rw-r--r--. 1 1000 root     0 Oct  3 07:28 write.lock

app-000006/0/translog:
total 8
-rw-r--r--. 1 1000 root 43 Oct 18 11:17 translog-1.tlog
-rw-r--r--. 1 1000 root 20 Oct 18 11:17 translog.ckp

app-000006/_state:
total 4
-rw-r--r--. 1 1000 root 2398 Oct 18 11:17 state-21.st

In this case files are listed under this path. Lucene index ($index/0/index) and the translog ($index/0/translog). If there was no such data, the index most probably had to be deleted.

Now let's confirm the health of the index:

 $ index=app-000006
 $ SHARD_PATH=/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/indices/$index/0/index/
 $ java -cp lib:/usr/share/java/elasticsearch/lib/lucene-core-5.5.2.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex $SHARD_PATH
Segments file=segments_o numSegments=4 version=5.5.2 id=eotlkti4ft2r8tn0f1s0olgay format= userData={sync_id=AV8vOK7AWmARNjQAjneo, translog_generation=1, translog_uuid=-3GVLLRxQk2qELxESctRBQ}
  1 of 4: name=_0 maxDoc=14
    version=5.5.2
    id=9c92b9qnr9xviphjsif15llfm
    codec=Lucene54
    compound=true
    numFiles=3
    size (MB)=0.019
    diagnostics = {java.runtime.version=1.8.0_141-b16, java.vendor=Oracle Corporation, java.version=1.8.0_141, java.vm.version=25.141-b16, lucene.version=5.5.2, os=Linux, os.arch=amd64, os.version=3.10.0-693.el7.x86_64, source=flush, timestamp=1507014508694}
    no deletions
    test: open reader.........OK [took 0.102 sec]
    test: check integrity.....OK [took 0.002 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [32 fields] [took 0.001 sec]
    test: field norms.........OK [6 fields] [took 0.001 sec]
    test: terms, freq, prox...OK [276 terms; 1670 terms/docs pairs; 956 tokens] [took 0.036 sec]
    test: stored fields.......OK [28 total field count; avg 2.0 fields per doc] [took 0.015 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [23 docvalues fields; 0 BINARY; 1 NUMERIC; 0 SORTED; 3 SORTED_NUMERIC; 19 SORTED_SET] [took 0.018 sec]

  2 of 4: name=_1 maxDoc=15
    version=5.5.2
    id=9c92b9qnr9xviphjsif15llfo
    codec=Lucene54
    compound=true
    numFiles=3
    size (MB)=0.02
    diagnostics = {java.runtime.version=1.8.0_141-b16, java.vendor=Oracle Corporation, java.version=1.8.0_141, java.vm.version=25.141-b16, lucene.version=5.5.2, os=Linux, os.arch=amd64, os.version=3.10.0-693.el7.x86_64, source=flush, timestamp=1507014513710}
    no deletions
    test: open reader.........OK [took 0.007 sec]
    test: check integrity.....OK [took 0.000 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [32 fields] [took 0.000 sec]
    test: field norms.........OK [6 fields] [took 0.000 sec]
    test: terms, freq, prox...OK [270 terms; 1807 terms/docs pairs; 1058 tokens] [took 0.032 sec]
    test: stored fields.......OK [30 total field count; avg 2.0 fields per doc] [took 0.001 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [23 docvalues fields; 0 BINARY; 1 NUMERIC; 0 SORTED; 3 SORTED_NUMERIC; 19 SORTED_SET] [took 0.004 sec]

  3 of 4: name=_2 maxDoc=3
    version=5.5.2
    id=9c92b9qnr9xviphjsif15llfr
    codec=Lucene54
    compound=true
    numFiles=3
    size (MB)=0.014
    diagnostics = {java.runtime.version=1.8.0_141-b16, java.vendor=Oracle Corporation, java.version=1.8.0_141, java.vm.version=25.141-b16, lucene.version=5.5.2, os=Linux, os.arch=amd64, os.version=3.10.0-693.el7.x86_64, source=flush, timestamp=1507014518731}
    no deletions
    test: open reader.........OK [took 0.018 sec]
    test: check integrity.....OK [took 0.001 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [32 fields] [took 0.000 sec]
    test: field norms.........OK [6 fields] [took 0.000 sec]
    test: terms, freq, prox...OK [142 terms; 362 terms/docs pairs; 210 tokens] [took 0.010 sec]
    test: stored fields.......OK [6 total field count; avg 2.0 fields per doc] [took 0.000 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [23 docvalues fields; 0 BINARY; 1 NUMERIC; 0 SORTED; 3 SORTED_NUMERIC; 19 SORTED_SET] [took 0.004 sec]

  4 of 4: name=_3 maxDoc=35
    version=5.5.2
    id=9c92b9qnr9xviphjsif15llft
    codec=Lucene54
    compound=true
    numFiles=3
    size (MB)=0.029
    diagnostics = {java.runtime.version=1.8.0_141-b16, java.vendor=Oracle Corporation, java.version=1.8.0_141, java.vm.version=25.141-b16, lucene.version=5.5.2, os=Linux, os.arch=amd64, os.version=3.10.0-693.el7.x86_64, source=flush, timestamp=1507014523755}
    no deletions
    test: open reader.........OK [took 0.013 sec]
    test: check integrity.....OK [took 0.002 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [32 fields] [took 0.000 sec]
    test: field norms.........OK [6 fields] [took 0.000 sec]
    test: terms, freq, prox...OK [443 terms; 4305 terms/docs pairs; 2530 tokens] [took 0.031 sec]
    test: stored fields.......OK [70 total field count; avg 2.0 fields per doc] [took 0.004 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [23 docvalues fields; 0 BINARY; 1 NUMERIC; 0 SORTED; 3 SORTED_NUMERIC; 19 SORTED_SET] [took 0.011 sec]

No problems were detected with this index.

Took 0.441 sec total.

As no problems were detected the problem should only be related to Elasticsearch metadata so we can just reroute the shard to the same node:

es_util --query=_cluster/reroute -XPOST -d '{
"commands" : [ {
  "allocate" : {
  "index" : "app-000006",
  "shard" : 0,
  "node" : "10.128.2.127",
  "allow_primary" : true
  }
}
]
}'

For further information on rerouting shards, see this related solution

Elasticsearch is out of disk

Elasticsearch has a so called disk watermark configured which is a disk capacity threshold that once reached this node will not store new replicas.
A log similar to this can be seen

[2017-05-22 08:28:37,222][INFO ][cluster.routing.allocation.decider] [Agron] low disk watermark [85%] exceeded on [HApfFHgJR36Fd2pRFd1iWQ][Agron][/elasticsearch/persistent/logging-es/data/logging-es/nodes/0] free: 36.3gb[14.5%], replicas will not be assigned to this node

The consequences are:

Fluentd cannot send logs (for new indices)
Kibana cannot create user data for a new user

One way to discard this problem is by checking the current value of the disk watermark:

In the configuration file. The values showed are the default ones meaning that if they are not set, those values will be used.

  elasticsearch.yml: |
    cluster:
      name: ${CLUSTER_NAME}
      routing.allocation.disk.threshold_enabled: true
      routing.allocation.disk.watermark.low: 85%
      routing.allocation.disk.watermark.high: 90%

Using the API. If nothing is shown the values by default or set in the configuration will be set.

# oc exec -c elasticsearch $es_pod -- es_util --query=_cluster/settings?pretty
{
  "persistent" : { },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "disk" : {
            "watermark" : {
              "low" : "65%"
            }
          }
        }
      }
    }
  }
}

Check the available space:

$ for pod in `oc get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do echo $pod; oc exec -c elasticsearch $pod -- df -h /elasticsearch/persistent; done
logging-es-qzw8xmt0-16-4drmg
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        10G  6.7G  3.4G  67% /elasticsearch/persistent
logging-es-smeuexjr-16-vc9cv
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        10G  7.7G  2.4G  77% /elasticsearch/persistent

If disk watermark is being exceeded, it will be necessary to either expand the persistent volume, or delete indices. The logs are regularly pruned by the Elasticsearch IndexManagement pods. To avoid space issues, we can prune logs after a fewer number of days. For more information, see the retention documentation.

Gathering Elasticsearch logs

4.6+

In current versions, the Elasticsearch logs can be viewed with the oc logs command.

$ oc logs -c elasticsearch $es_pod

4.0-4.5

Elasticsearch is configured to use different files together with the standard output. These files are found under /elasticsearch/persistent/logging-es/logs/ inside the pods. There are a few log files:

- elasticsearch_deprecation.log --> Logging of deprecated actions that should be migrated in the future
- elasticsearch_index_indexing_slowlog.log --> Logs information about slow indexing
- elasticsearch_index_search_slowlog.log --> Logs slow searches
- elasticsearch.log --> Application log

The most commonly useful log is /elasticsearch/persistent/logging-es/logs/elasticsearch.log

Fluentd - Troubleshooting

Review the Buffer size inside the collector pods.

$ oc project openshift-logging
$ for i in $(oc get pods -l component=collector --no-headers | grep -i running | awk '{print $1}'); do echo $i; oc exec $i -- /bin/bash -c "du -khs /var/lib/fluentd/*"; done

Aggregated logging dump tool

OpenShift 4.x CLO Must-gather
The Content from github.com is not included.cluster-logging-must-gather is a tool built on top of OpenShift must-gather that expands its capabilities to gather Openshift Cluster Logging information.

Follow the steps indicated in documentation "Collecting OpenShift Logging data" for gathering the logs needed to be analyzed.

Gathering this will provide the configurations, logs, and other information required to troubleshoot the cluster logging stack.

SBR

Product(s)

Category

Troubleshoot

Components

Tags

Article Type

General