Troubleshooting OpenShift Container Platform 4: Cluster Logging with Elasticsearch and Fluentd
This article is for troubleshooting OpenShift Container Platform 4.x Cluster Logging only. For 3.x troubleshooting, see this related article
This article is part of the OpenShift Container Platform 4.x troubleshooting series
EFK Components
-
kibana:
-
Kibana application (kibana container) which is a NodeJS-based UI tool used to visualize logs stored in Elasticsearch. Kibana will query Elasticsearch using the rest API and provide user-scoped configuration.
-
Kibana proxy (kibana-proxy container) is a NodeJS application that will intercept all user requests and provide integration with OCP Single Sign-On
-
-
elasticsearch: This is the Content from www.elastic.co is not included.Elasticsearch cluster, where logs are sent from Fluentd pods and persisted. Elasticsearch is configured with SearchGuard plugin that provides user-scoped authorization.
-
fluentd: Content from www.fluentd.org is not included.Fluentd is an application (formally a data collector) that is deployed on each node (matching the nodeSelectors) and will use some additional privileges to gather logs from the running containers and send them to Elasticsearch
-
index-management: The
elasticsearch-impods run on a cronjob. They rollover indices and delete old indices to free disk space on Elasticsearch. -
logging-curator: The Curator can be configured to periodically clean up old logs and free disk space on Elasticsearch. It will communicate with Elasticsearch using the REST API as the rest of the components. This is only used for indices created before OpenShift 4.5
Index
- Basic maintenance operation
- Elasticsearch health
- Elasticsearch indices
- Troubleshoot unassigned shards
- Elasticsearch is out of disk
- Gathering Elasticsearch logs
- Aggregated logging dump tool
Basic maintenance operations
- Querying Elasticsearch can usually be done by accessing any node of the cluster. As the API is protected by the use of certificates, in order to have admin access, the Elasticsearch certificate must be used. They are kept in
/etc/elasticsearch/secret/. An example of the Elasticsearch API is the following:
Export the ElasticSearch pod:
$ es_pod=$(oc get pod --selector=component=elasticsearch --no-headers -o jsonpath='{range .items[?(@.status.phase=="Running")]}{.metadata.name}{"\n"}{end}' | head -n1)
Run the API query
$ oc exec -c elasticsearch $es_pod -- curl -s --key /etc/elasticsearch/secret/admin-key --cert /etc/elasticsearch/secret/admin-cert --cacert /etc/elasticsearch/secret/admin-ca https://localhost:9200/<API_CALL>
- These API queries can be simplified using the built-in utility,
es_util
$ oc exec -c elasticsearch $es_pod -- es_util --query=<API_CALL>
- Scaling up/down the ES deployments
=> Scale down
$ for pod in `oc get deployment.apps -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do oc scale deployment.apps/$pod --replicas=0; done
=> Scale up
$ for pod in `oc get deployment.apps -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do oc scale deployment.apps/$pod --replicas=1; done
Elasticsearch health
- Get cluster health. Cluster health is a state of the global state of all the nodes and indices.
- RED state means that at least one index is in RED state (it's primary shard has not been recovered) or that the minimum amount of Elasticsearch nodes has not reached (i.e. QUORUM). Writes are not allowed, cluster is not accessible.
- YELLOW state means that all indices are at least in this same state, which means that at least one replica shard is not assigned/recovered.
- GREEN: All ES nodes are running and all indices are fully recovered.
The health query will show things like the pending tasks in case it is recovering, the number of shards, the percent of active shards and the unassigned shards.
$ oc exec -c elasticsearch $es_pod -- es_util --query=_cat/health?v
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1502093481 08:11:21 logging-es green 3 3 11 5 0 0 0 0 - 100.0%
- Nodes status. Each ElasticSearch instance (pod) is internally called a node. This command will show basic information about all the nodes forming the cluster, who is the master (marked with *), their name and also the percent of RAM used out of the total available along with the HEAP percent used out of the total RAM used. Dont get confused when you see high values of RAM percent because this is the maximum available to be used by the HEAP. An indicator of being low in memory is having a high HEAP percent close to the ram.percent which will require allocating more RAM.
$ oc exec -c elasticsearch $es_pod -- es_util --query=_cat/nodes?v
host ip heap.percent ram.percent load node.role master name
10.0.0.1 10.0.0.1 34 84 1.58 d m elasticsearch-1-example
10.0.0.2 10.0.0.2 41 99 2.11 d m elasticsearch-2-example
10.0.0.3 10.0.0.3 36 99 1.35 d m elasticsearch-3-example
Elasticsearch indices
- List indices
$ oc exec -c elasticsearch $es_pod -- es_util --query=_cat/indices?v
- List indices along with creation date
oc exec -c elasticsearch $es_pod -- es_util --query=_cat/indices?h=health,status,index,id,pri,rep,docs.count,docs.deleted,store.size,creation.date.string
- Inspect documents inside the indices:
$ oc exec $es_pod -- es_util --query='$index_name-XXX/_search?pretty'
- Delete an index. In case an index is corrupted or simply no longer necessary it can be deleted.
$ oc exec -c elasticsearch $es_pod -- es_util --query=<index_name> -XDELETE
- Change the number of replicas. After changing the global number of desired replicas by modifying the ClusterLogging instance, the change will only affect newly created indices. In order to change the number of replicas of one index the following command can be used:
$ oc exec -c elasticsearch $es_pod -- es_util --query=<index_name>/_settings -d '{ "index" : { "number_of_replicas" : 2 } }' -XPUT
### Change number of replicas to ALL indices
$ oc exec -c elasticsearch $es_pod -- es_util --query=*/_settings -d '{ "index" : { "number_of_replicas" : 2 } }' -XPUT
-
Special indices: By default the
indicesAPI query will show the health, status, documents and size data about each index. It is useful to know what to expect here. Some relevant index names are:.kibana- Contains user information like predefined searches, dashboards and other settings.security- Is a metadata index used by Elasticsearchapp-XXXXXX- Contains the application pod logsinfra-XXXXXX- Contains OpenShift infrastructure logsaudit-XXXXXX- Contains the cluster audit logs
-
In OpenShift 4.5 and earlier, the below indices may also be in the index list:
.searchguard.<pod_name>- Contains security information regarding SearchGuard initialization. It MUST contain 5 documents.operations.YYYY.MM.DD- Logs coming from "operations" projects like "default". This will only appear on logging-ops cluster if it existsproject.<project_name>.<uuid>.YYYY.MM.DD- Project related index created by day containing all logs related to that project for that day
$ oc exec -c elasticsearch $es_pod -- es_util --query=_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
green open app-000005 1 0 926 0 405.3kb 405.3kb
green open .searchguard 1 2 5 0 83.1kb 27.7kb
green open .kibana 1 0 1 0 3kb 3kb
green open app-000006 1 0 155 0 158.9kb 158.9kb
- Shards. Each index is made up of 1+ primary shards and 0+ replica shards. The
shardsAPI shows where they are stored, whether they are primary or replica shards (p/s)
$ oc exec -c elasticsearch $es_pod -- es_util --query=_cat/shards?v
index shard prirep state docs store ip node
app-000006 0 p STARTED 472 370.7kb 10.128.1.184 elasticsearch-1-example
app-000006 0 r STARTED 472 390.2kb 10.129.0.254 elasticsearch-2-example
- Unassigned shards. When there are some unassigned shards, the health of the cluster will be affected. The same
shardsAPI is used but the columns displayed can be customized.
$ oc exec -c elasticsearch $es_pod -- es_util --query=_cat/shards?h=index,shard,prirep,state,unassigned.reason,node | grep UNASSIGNED
app-000001 0 r UNASSIGNED NODE_LEFT
infra-000001 1 r UNASSIGNED NODE_LEFT
infra-000001 2 r UNASSIGNED NODE_LEFT
audit-000001 0 r UNASSIGNED NODE_LEFT
- Pending tasks. Elasticsearch defines tasks for any cluster-level change (e.g. create index, update mapping, allocate or fail shard) which is still to be executed.
$ oc exec -c elasticsearch $es_pod -- es_util --query=_cluster/pending_tasks
{
"tasks": [
{
"insert_order": 101,
"priority": "URGENT",
"source": "create-index [foo_9], cause [api]",
"time_in_queue_millis": 86,
"time_in_queue": "86ms"
},
{
"insert_order": 46,
"priority": "HIGH",
"source": "shard-started ([foo_2][1], node[tMTocMvQQgGCkj7QDHl3OA], [P], s[INITIALIZING]), reason [after recovery from shard_store]",
"time_in_queue_millis": 842,
"time_in_queue": "842ms"
},
]
}
- Recovery. Is a view of index shard recoveries, both on-going and previously completed. A recovery event occurs anytime an index shard moves to a different node in the cluster. This can happen during a snapshot recovery, a change in replication level, node failure, or on node startup. This last type is called a local store recovery and is the normal way for shards to be loaded from disk when a node starts up.
$ oc exec -c elasticsearch $es_pod -- es_util --query=_cat/recovery?v
index shard time type stage source_host source_node target_host target_node repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
.operations.2050.09.20 0 13ms store done n/a n/a node0 node-0 n/a n/a 0 0 100% 13 0 0 100% 9928 0 0 100.0%
- Thread pool. If Elasticsearch is silently dropping records due to bulk index rejections the bulk.rejected will not be zero.
$ oc exec -c elasticsearch $es_pod -- es_util --query=_cat/thread_pool?v\&h=host,bulk.completed,bulk.rejected,bulk.queue,bulk.active,bulk.queueSize
host ip bulk.active bulk.queue bulk.rejected index.active index.queue index.rejected search.active search.queue search.rejected
192.168.1.100 192.168.1.100 0 0 3 0 0 0 0 0 0
192.168.1.101 192.168.1.101 0 0 0 0 0 0 0 0 0
192.168.1.102 192.168.1.102 0 0 0 0 0 0 0 0 0
Troubleshoot unassigned shards
Troubleshooting and restoring unassigned shards requires some deep knowledge about how Lucene and Elasticsearch works but here are some error-solution approaches for the most common cases.
For a quick explanation of the status and cause of unassigned shards, use the explain API:
$ oc exec -c elasticsearch $es_pod -- es_util --query=_cluster/allocation/explain?pretty
In many cases, the output will explain the issue and/or next steps, or further research can be done on the cause.
For a more hands-on investigation, or to investigate individual shards, the first thing to do is to gather the real status of the indices and shards. Refer to the "unassigned shards" query in order to identify them
$ oc exec -c elasticsearch $es_pod -- es_util --query=_cat/shards?h=index,shard,prirep,state,unassigned.reason,node | grep UNASSIGNED
.operations.2050.05.28 0 p UNASSIGNED CLUSTER_RECOVERED
The reason will tell why they are not assigned but it's not easy to know what to do in order to recover them, and if it is even possible.
Reason: NODE_LEFT
Can be that for some reason the cluster had an extra node and this shard was assigned to that node.
- Elasticsearch pod scaled to more than 1 by mistake
- The Elasticsearch cluster for some reason now has one node less
- The persistent storage has been purged/removed/lost
If one Elasticsearch pod is in CrashLoopBackOff, or running but with not all containers running, or otherwise not in a healthy state, this can usually be solved by recovering the Elasticsearch pod. It may take some time for Elasticsearch to recover.
If the persistent volume is lost and cannot be recovered, this is solved by rerouting the shard to the "new" node (pay attention to the index name, the shard id and the target node):
$ oc rsh -c elasticsearch $es_pod
$ curl --key /etc/elasticsearch/secret/admin-key --cert /etc/elasticsearch/secret/admin-cert --cacert /etc/elasticsearch/secret/admin-ca -XPOST 'https://localhost:9200/_cluster/reroute' -d '{
"commands" : [ {
"allocate" : {
"index" : "app-000006",
"shard" : 0,
"node" : "10.10.10.101",
"allow_primary" : true
}
}
]
}'
This can also be solved by changing the number of replicas to 0 and then to the expected number
For further steps on rerouting shards, see this related solution
Reason: CLUSTER_RECOVERED
This indicates that while Elasticsearch has recovered and is running, some shards exist that have not been assigned to a node. This frequently occurs for indices that cover periods of time when Elasticsearch was not running, or indices which have been pruned of data (due to age or some other reason) but were not removed from the ES metadata. This can also be caused if data is missing from the persistent storage.
For indices that are not expected to exist anymore (or at all), the indices can be deleted. If the index is expected to be working (i.e. if it was working previously), Elasticsearch should recover the data on its own after some time.
Other UNASSIGNED reasons
An shard can be in an unassigned status either because
- The data is corrupted
- The data is not there
- The shard metadata is corrupted or missing
For the first two cases, if there is no other shard with the data, there is little that can be done and the shard must be deleted. In order to check whether the data is there and is fine, we need to access the persistent storage (data can be under 0 or 1, for many reasons not yet covered by this guide):
$ oc exec -c elasticsearch $es_pod -- ls -lR /elasticsearch/persistent/logging-es/data/logging-es/nodes/0/indices/$index
app-000006:
total 0
drwxr-xr-x. 5 1000 root 49 Oct 18 11:17 0
drwxr-xr-x. 2 1000 root 25 Oct 18 11:17 _state
app-000006/0:
total 0
drwxr-xr-x. 2 1000 root 25 Oct 18 11:17 _state
drwxr-xr-x. 2 1000 root 206 Oct 18 11:22 index
drwxr-xr-x. 2 1000 root 49 Oct 18 11:17 translog
app-000006/0/_state:
total 4
-rw-r--r--. 1 1000 root 126 Oct 18 11:17 state-25.st
app-000006/0/index:
total 124
-rw-r--r--. 1 1000 root 363 Oct 3 07:28 _0.cfe
-rw-r--r--. 1 1000 root 19564 Oct 3 07:28 _0.cfs
-rw-r--r--. 1 1000 root 374 Oct 3 07:28 _0.si
-rw-r--r--. 1 1000 root 363 Oct 3 07:28 _1.cfe
-rw-r--r--. 1 1000 root 19904 Oct 3 07:28 _1.cfs
-rw-r--r--. 1 1000 root 374 Oct 3 07:28 _1.si
-rw-r--r--. 1 1000 root 363 Oct 3 07:28 _2.cfe
-rw-r--r--. 1 1000 root 13780 Oct 3 07:28 _2.cfs
-rw-r--r--. 1 1000 root 374 Oct 3 07:28 _2.si
-rw-r--r--. 1 1000 root 363 Oct 3 07:28 _3.cfe
-rw-r--r--. 1 1000 root 29226 Oct 3 07:28 _3.cfs
-rw-r--r--. 1 1000 root 374 Oct 3 07:28 _3.si
-rw-r--r--. 1 1000 root 410 Oct 18 11:22 segments_o
-rw-r--r--. 1 1000 root 0 Oct 3 07:28 write.lock
app-000006/0/translog:
total 8
-rw-r--r--. 1 1000 root 43 Oct 18 11:17 translog-1.tlog
-rw-r--r--. 1 1000 root 20 Oct 18 11:17 translog.ckp
app-000006/_state:
total 4
-rw-r--r--. 1 1000 root 2398 Oct 18 11:17 state-21.st
In this case files are listed under this path. Lucene index ($index/0/index) and the translog ($index/0/translog). If there was no such data, the index most probably had to be deleted.
Now let's confirm the health of the index:
$ index=app-000006
$ SHARD_PATH=/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/indices/$index/0/index/
$ java -cp lib:/usr/share/java/elasticsearch/lib/lucene-core-5.5.2.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex $SHARD_PATH
Segments file=segments_o numSegments=4 version=5.5.2 id=eotlkti4ft2r8tn0f1s0olgay format= userData={sync_id=AV8vOK7AWmARNjQAjneo, translog_generation=1, translog_uuid=-3GVLLRxQk2qELxESctRBQ}
1 of 4: name=_0 maxDoc=14
version=5.5.2
id=9c92b9qnr9xviphjsif15llfm
codec=Lucene54
compound=true
numFiles=3
size (MB)=0.019
diagnostics = {java.runtime.version=1.8.0_141-b16, java.vendor=Oracle Corporation, java.version=1.8.0_141, java.vm.version=25.141-b16, lucene.version=5.5.2, os=Linux, os.arch=amd64, os.version=3.10.0-693.el7.x86_64, source=flush, timestamp=1507014508694}
no deletions
test: open reader.........OK [took 0.102 sec]
test: check integrity.....OK [took 0.002 sec]
test: check live docs.....OK [took 0.000 sec]
test: field infos.........OK [32 fields] [took 0.001 sec]
test: field norms.........OK [6 fields] [took 0.001 sec]
test: terms, freq, prox...OK [276 terms; 1670 terms/docs pairs; 956 tokens] [took 0.036 sec]
test: stored fields.......OK [28 total field count; avg 2.0 fields per doc] [took 0.015 sec]
test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
test: docvalues...........OK [23 docvalues fields; 0 BINARY; 1 NUMERIC; 0 SORTED; 3 SORTED_NUMERIC; 19 SORTED_SET] [took 0.018 sec]
2 of 4: name=_1 maxDoc=15
version=5.5.2
id=9c92b9qnr9xviphjsif15llfo
codec=Lucene54
compound=true
numFiles=3
size (MB)=0.02
diagnostics = {java.runtime.version=1.8.0_141-b16, java.vendor=Oracle Corporation, java.version=1.8.0_141, java.vm.version=25.141-b16, lucene.version=5.5.2, os=Linux, os.arch=amd64, os.version=3.10.0-693.el7.x86_64, source=flush, timestamp=1507014513710}
no deletions
test: open reader.........OK [took 0.007 sec]
test: check integrity.....OK [took 0.000 sec]
test: check live docs.....OK [took 0.000 sec]
test: field infos.........OK [32 fields] [took 0.000 sec]
test: field norms.........OK [6 fields] [took 0.000 sec]
test: terms, freq, prox...OK [270 terms; 1807 terms/docs pairs; 1058 tokens] [took 0.032 sec]
test: stored fields.......OK [30 total field count; avg 2.0 fields per doc] [took 0.001 sec]
test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
test: docvalues...........OK [23 docvalues fields; 0 BINARY; 1 NUMERIC; 0 SORTED; 3 SORTED_NUMERIC; 19 SORTED_SET] [took 0.004 sec]
3 of 4: name=_2 maxDoc=3
version=5.5.2
id=9c92b9qnr9xviphjsif15llfr
codec=Lucene54
compound=true
numFiles=3
size (MB)=0.014
diagnostics = {java.runtime.version=1.8.0_141-b16, java.vendor=Oracle Corporation, java.version=1.8.0_141, java.vm.version=25.141-b16, lucene.version=5.5.2, os=Linux, os.arch=amd64, os.version=3.10.0-693.el7.x86_64, source=flush, timestamp=1507014518731}
no deletions
test: open reader.........OK [took 0.018 sec]
test: check integrity.....OK [took 0.001 sec]
test: check live docs.....OK [took 0.000 sec]
test: field infos.........OK [32 fields] [took 0.000 sec]
test: field norms.........OK [6 fields] [took 0.000 sec]
test: terms, freq, prox...OK [142 terms; 362 terms/docs pairs; 210 tokens] [took 0.010 sec]
test: stored fields.......OK [6 total field count; avg 2.0 fields per doc] [took 0.000 sec]
test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
test: docvalues...........OK [23 docvalues fields; 0 BINARY; 1 NUMERIC; 0 SORTED; 3 SORTED_NUMERIC; 19 SORTED_SET] [took 0.004 sec]
4 of 4: name=_3 maxDoc=35
version=5.5.2
id=9c92b9qnr9xviphjsif15llft
codec=Lucene54
compound=true
numFiles=3
size (MB)=0.029
diagnostics = {java.runtime.version=1.8.0_141-b16, java.vendor=Oracle Corporation, java.version=1.8.0_141, java.vm.version=25.141-b16, lucene.version=5.5.2, os=Linux, os.arch=amd64, os.version=3.10.0-693.el7.x86_64, source=flush, timestamp=1507014523755}
no deletions
test: open reader.........OK [took 0.013 sec]
test: check integrity.....OK [took 0.002 sec]
test: check live docs.....OK [took 0.000 sec]
test: field infos.........OK [32 fields] [took 0.000 sec]
test: field norms.........OK [6 fields] [took 0.000 sec]
test: terms, freq, prox...OK [443 terms; 4305 terms/docs pairs; 2530 tokens] [took 0.031 sec]
test: stored fields.......OK [70 total field count; avg 2.0 fields per doc] [took 0.004 sec]
test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
test: docvalues...........OK [23 docvalues fields; 0 BINARY; 1 NUMERIC; 0 SORTED; 3 SORTED_NUMERIC; 19 SORTED_SET] [took 0.011 sec]
No problems were detected with this index.
Took 0.441 sec total.
As no problems were detected the problem should only be related to Elasticsearch metadata so we can just reroute the shard to the same node:
es_util --query=_cluster/reroute -XPOST -d '{
"commands" : [ {
"allocate" : {
"index" : "app-000006",
"shard" : 0,
"node" : "10.128.2.127",
"allow_primary" : true
}
}
]
}'
For further information on rerouting shards, see this related solution
Elasticsearch is out of disk
Elasticsearch has a so called disk watermark configured which is a disk capacity threshold that once reached this node will not store new replicas.
A log similar to this can be seen
[2017-05-22 08:28:37,222][INFO ][cluster.routing.allocation.decider] [Agron] low disk watermark [85%] exceeded on [HApfFHgJR36Fd2pRFd1iWQ][Agron][/elasticsearch/persistent/logging-es/data/logging-es/nodes/0] free: 36.3gb[14.5%], replicas will not be assigned to this node
The consequences are:
- Fluentd cannot send logs (for new indices)
- Kibana cannot create user data for a new user
One way to discard this problem is by checking the current value of the disk watermark:
- In the configuration file. The values showed are the default ones meaning that if they are not set, those values will be used.
elasticsearch.yml: |
cluster:
name: ${CLUSTER_NAME}
routing.allocation.disk.threshold_enabled: true
routing.allocation.disk.watermark.low: 85%
routing.allocation.disk.watermark.high: 90%
- Using the API. If nothing is shown the values by default or set in the configuration will be set.
# oc exec -c elasticsearch $es_pod -- es_util --query=_cluster/settings?pretty
{
"persistent" : { },
"transient" : {
"cluster" : {
"routing" : {
"allocation" : {
"disk" : {
"watermark" : {
"low" : "65%"
}
}
}
}
}
}
}
- Check the available space:
$ for pod in `oc get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do echo $pod; oc exec -c elasticsearch $pod -- df -h /elasticsearch/persistent; done
logging-es-qzw8xmt0-16-4drmg
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 10G 6.7G 3.4G 67% /elasticsearch/persistent
logging-es-smeuexjr-16-vc9cv
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 10G 7.7G 2.4G 77% /elasticsearch/persistent
If disk watermark is being exceeded, it will be necessary to either expand the persistent volume, or delete indices. The logs are regularly pruned by the Elasticsearch IndexManagement pods. To avoid space issues, we can prune logs after a fewer number of days. For more information, see the retention documentation.
Gathering Elasticsearch logs
4.6+
In current versions, the Elasticsearch logs can be viewed with the oc logs command.
$ oc logs -c elasticsearch $es_pod
4.0-4.5
Elasticsearch is configured to use different files together with the standard output. These files are found under /elasticsearch/persistent/logging-es/logs/ inside the pods. There are a few log files:
- elasticsearch_deprecation.log --> Logging of deprecated actions that should be migrated in the future
- elasticsearch_index_indexing_slowlog.log --> Logs information about slow indexing
- elasticsearch_index_search_slowlog.log --> Logs slow searches
- elasticsearch.log --> Application log
The most commonly useful log is /elasticsearch/persistent/logging-es/logs/elasticsearch.log
Fluentd - Troubleshooting
- Review the Buffer size inside the collector pods.
$ oc project openshift-logging
$ for i in $(oc get pods -l component=collector --no-headers | grep -i running | awk '{print $1}'); do echo $i; oc exec $i -- /bin/bash -c "du -khs /var/lib/fluentd/*"; done
Aggregated logging dump tool
- OpenShift 4.x CLO Must-gather
The Content from github.com is not included.cluster-logging-must-gather is a tool built on top of OpenShift must-gather that expands its capabilities to gather Openshift Cluster Logging information.
Follow the steps indicated in documentation "Collecting OpenShift Logging data" for gathering the logs needed to be analyzed.
Gathering this will provide the configurations, logs, and other information required to troubleshoot the cluster logging stack.