Troubleshooting OpenShift Container Platform 3.x: Aggregating Container Logging

Updated

This article is specific to OpenShift 3.x Logging. Click here to find help with OpenShift 4.x Logging.

EFK Components

  • logging-kibana:

    • Kibana application (kibana container) which is a NodeJS-based UI tool used to visualize logs stored in Elasticsearch. Kibana will query Elasticsearch using the rest API and provide user-scoped configuration.

    • Kibana proxy (kibana-proxy container) is a NodeJS application that will intercept all user requests and provide integration with OCP Single Sign-On

  • logging-elasticsearch: This is the Content from www.elastic.co is not included.Elasticsearch cluster, where logs are sent from Fluentd pods and persisted. Elasticsearch is configured with SearchGuard plugin that provides user-scoped authorization.

    • The SearchGuard index is part of the infrastructure of the logging cluster; these indices get thrown away and rebuilt when the pod respawns, and do not contain any application data. Although ElasticSearch views this as blocking the cluster from starting up, in reality, this is likely just an index that should have deleted but did not when the pod terminated.
  • logging-fluentd: Content from www.fluentd.org is not included.Fluentd is an application (formally a data collector) that is deployed on each node (matching the nodeSelectors) and will use some additional privileges to gather logs from the running containers and send them to Elasticsearch

  • logging-curator: The Curator can be configured to periodically clean up old logs and free some disk space from Elasticsearch. It will communicate with Elasticsearch using the REST API as the rest of the components.

NOTE: For OPS cluster you usually need to change references to es by es-ops

Index

Basic maintenance operations

  • Querying Elasticsearch can usually be done by accessing any node of the cluster. As the API is protected by the use of certificates, in order to have admin access, the Elasticsearch certificate must be used. They are kept in /etc/elasticsearch/secret/. An example of the Elasticsearch API is the following:
$ es_pod=$(oc get po --selector=component=es --no-headers -o jsonpath='{range .items[?(@.status.phase=="Running")]}{.metadata.name}{"\n"}{end}' | head -n1)
$ oc exec $es_pod -- curl -s --key /etc/elasticsearch/secret/admin-key --cert /etc/elasticsearch/secret/admin-cert --cacert /etc/elasticsearch/secret/admin-ca https://localhost:9200/<API_CALL>
  • Restart the Elasticsearch cluster: In order to gracefully restart the Elasticsearch nodes it is better to scale down all the DeploymentConfigs and then scale them back up.
# Scale down to 0 all the DCs
$ for dc in `oc get dc -o=name --selector=component=es`; do oc scale $dc --replicas=0; done
deploymentconfig "logging-es-48cw9b87" scaled
deploymentconfig "logging-es-e4qolijk" scaled
deploymentconfig "logging-es-op469nvi" scaled


# Confirm all logging-es pods are terminated
$ oc get po --selector=component=es
No resources found.

# Scale up to 1 all the DCs
$ for dc in `oc get dc -o=name --selector=component=es`; do oc scale $dc --replicas=1; done
deploymentconfig "logging-es-48cw9b87" scaled
deploymentconfig "logging-es-e4qolijk" scaled
deploymentconfig "logging-es-op469nvi" scaled
  • Retrieve the exact version of the Docker images for each of the running pods. The name of the Dockerfile kept under /root/buildinfo is named after the component including the version and revision.
$ oc get po -o 'go-template={{range $pod := .items}}{{if eq $pod.status.phase "Running"}}{{range $container := $pod.spec.containers}}oc exec -c {{$container.name}} {{$pod.metadata.name}} -- find /root/buildinfo -name Dockerfile-openshift* | grep -o logging.* {{"\n"}}{{end}}{{end}}{{end}}' | bash -
logging-curator-3.5.0-16
logging-elasticsearch-3.5.0-29
logging-fluentd-3.5.0-18
logging-fluentd-3.5.0-18
logging-kibana-3.5.0-17
logging-auth-proxy-3.5.0-18
  • List indices
$ oc exec $es_pod -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es:9200/_cat/indices?v 
  • Delete an index. In case an index is corrupted or simply no longer necessary it can be deleted.
$ oc exec $es_pod -- curl -s --key /etc/elasticsearch/secret/admin-key --cert /etc/elasticsearch/secret/admin-cert --cacert /etc/elasticsearch/secret/admin-ca -XDELETE https://localhost:9200/<index_name>
  • Change number of replicas. After changing the global of the number of desired replicas and running the playbook, the change will only affect newly created indices. In order to change the number of replicas of one index the following command can be used (remember to set the right number as the value for number_of_replicas:
$ oc exec $es_pod -- curl -s --key /etc/elasticsearch/secret/admin-key --cert /etc/elasticsearch/secret/admin-cert --cacert /etc/elasticsearch/secret/admin-ca -XPUT https://localhost:9200/<index_name>/_settings -d '{ "index" : { "number_of_replicas" : 2 } }'
### Change number of replicas to ALL indices
$ oc exec $es_pod -- curl -s --key /etc/elasticsearch/secret/admin-key --cert /etc/elasticsearch/secret/admin-cert --cacert /etc/elasticsearch/secret/admin-ca -XPUT https://localhost:9200/*/_settings -d '{ "index" : { "number_of_replicas" : 2 } }'

Elasticsearch health

  • Get cluster health. Cluster health is a state of the global state of all the nodes and indices.
    • RED state means that at least one index is in RED state (it's primary shard has not been recovered) or that the minimum amount of Elasticsearch nodes has not reached (i.e. CUORUM). Writes are not allowed, cluster is not accessible.
    • YELLOW state means that all indices are at least in this same state, which means that at least one replica shard is not assigned/recovered.
    • GREEN: All ES nodes are running and all indices are fully recovered.
      The health query will show things like the pending tasks in case it is recovering, the number of shards, the percent of active shards and the unassigned shards.
$ oc exec $es_pod -- curl -s --key /etc/elasticsearch/secret/admin-key --cert /etc/elasticsearch/secret/admin-cert --cacert /etc/elasticsearch/secret/admin-ca https://localhost:9200/_cat/health?v
epoch      timestamp cluster    status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent 
1502093481 08:11:21  logging-es green           3         3     11   5    0    0        0             0                  -                100.0%
  • Nodes status. Will show basic information about all the nodes forming the cluster, who is the master (marked with *), their name and also the percent of RAM used out of the total available along with the HEAP percent used out of the total RAM used. Dont get confused when you see high values of RAM percent because this is the maximum available to be used by the HEAP. An indicator of being low in memory is having a high HEAP percent close to the ram.percent which will require allocating more RAM.
oc exec $es_pod -- curl -s --key /etc/elasticsearch/secret/admin-key --cert /etc/elasticsearch/secret/admin-cert --cacert /etc/elasticsearch/secret/admin-ca https://localhost:9200/_cat/nodes?v
host       ip         heap.percent ram.percent load node.role master name           
10.0.0.1  10.0.0.1            34          84 1.58 d         m      Buzzard    
10.0.0.2  10.0.0.2            41          99 2.11 d         m      Tony Stark 
10.0.0.3  10.0.0.3            36          99 1.35 d         m      Orphan  
  • Indices. By default the indices API query will show the health, status, documents and size data about each index. It is useful to know what to expect here. Some relevant index names are:
    • .kibana - Contains user information like predefined searches, dashboards and other settings
    • .searchguard.<pod_name> - Contains security information regarding SearchGuard initialization. It MUST contain 5 documents
    • .operations.YYYY.MM.DD - Logs coming from "operations" projects like "default". This will only appear on logging-ops cluster if it exists
    • project.<project_name>.<uuid>.YYYY.MM.DD - Project related index created by day containing all logs related to that project for that day
$ oc exec $es_pod -- curl -s --key /etc/elasticsearch/secret/admin-key --cert /etc/elasticsearch/secret/admin-cert --cacert /etc/elasticsearch/secret/admin-ca https://localhost:9200/_cat/indices?v
health status index                                                           pri rep docs.count docs.deleted store.size pri.store.size 
green  open   project.logging.a439b3dc-7acc-11e7-81e0-fa163e44925b.2017.08.07   1   0        926            0    405.3kb        405.3kb 
green  open   .searchguard.logging-es-e4qolijk-1-dfsmt                          1   2          5            0     83.1kb         27.7kb 
green  open   .kibana                                                           1   0          1            0        3kb            3kb 
green  open   .searchguard.logging-es-op469nvi-1-19f9q                          1   2          5            0     83.1kb         27.7kb 
green  open   .operations.2017.08.07                                            1   0      46053            0       20mb           20mb 
green  open   project.jenkins.ae12be1e-71c8-11e7-9c02-525400f41652.2017.08.07   1   0        155            0    158.9kb        158.9kb 
green  open   .searchguard.logging-es-48cw9b87-1-r7tfk                          1   2          5            0     83.1kb         27.7kb 
  • Shards. The shards API shows where they are stored, whether they are primary or replica shards (p/s)
$ oc exec $es_pod -- curl -s --key /etc/elasticsearch/secret/admin-key --cert /etc/elasticsearch/secret/admin-cert --cacert /etc/elasticsearch/secret/admin-ca https://localhost:9200/_cat/shards?v
index                                                           shard prirep state   docs   store ip           node           
.operations.2017.09.04                                          0     p      STARTED  472 370.7kb 10.128.1.184 American Eagle 
.operations.2017.09.04                                          0     r      STARTED  472 390.2kb 10.129.0.254 Scrier         
  • Unassigned shards. When there are some unassigned shards, the health of the cluster will be affected. The same shards API is used but the columns displayed can be customized.
$ oc exec $es_pod -- curl -s --key /etc/elasticsearch/secret/admin-key --cert /etc/elasticsearch/secret/admin-cert --cacert /etc/elasticsearch/secret/admin-ca https://localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason,node | grep UNASSIGNED
.operations.2017.05.28                                                                    0 p UNASSIGNED CLUSTER_RECOVERED
.operations.2017.05.28                                                                    0 r UNASSIGNED CLUSTER_RECOVERED
.operations.2017.05.28                                                                    0 r UNASSIGNED CLUSTER_RECOVERED
.operations.2017.05.28                                                                    0 r UNASSIGNED CLUSTER_RECOVERED

//In some versions, one of the following commands should be used instead
$ oc -c elasticsearch exec $es_pod -- es_util --query=_cat/shards?h=index,shard,prirep,state,unassigned.reason,node| grep UNASSIGNED

or

$ oc -c elasticsearch exec $es_pod -- QUERY=_cat/shards?h=index,shard,prirep,state,unassigned.reason,node  es_util | grep UNASSIGNED
  • Pending tasks. Elasticsearch defines tasks for any cluster-level change (e.g. create index, update mapping, allocate or fail shard) which is still to be executed.
$ oc exec $es_pod -- curl -s --key /etc/elasticsearch/secret/admin-key --cert /etc/elasticsearch/secret/admin-cert --cacert /etc/elasticsearch/secret/admin-ca https://localhost:9200/_cluster/pending_tasks
{
   "tasks": [
      {
         "insert_order": 101,
         "priority": "URGENT",
         "source": "create-index [foo_9], cause [api]",
         "time_in_queue_millis": 86,
         "time_in_queue": "86ms"
      },
      {
         "insert_order": 46,
         "priority": "HIGH",
         "source": "shard-started ([foo_2][1], node[tMTocMvQQgGCkj7QDHl3OA], [P], s[INITIALIZING]), reason [after recovery from shard_store]",
         "time_in_queue_millis": 842,
         "time_in_queue": "842ms"
      },
      {
         "insert_order": 45,
         "priority": "HIGH",
         "source": "shard-started ([foo_2][0], node[tMTocMvQQgGCkj7QDHl3OA], [P], s[INITIALIZING]), reason [after recovery from shard_store]",
         "time_in_queue_millis": 858,
         "time_in_queue": "858ms"
      }
  ]
}
  • Recovery. Is a view of index shard recoveries, both on-going and previously completed.
    A recovery event occurs anytime an index shard moves to a different node in the cluster. This can happen during a snapshot recovery, a change in replication level, node failure, or on node startup. This last type is called a local store recovery and is the normal way for shards to be loaded from disk when a node starts up.
$ oc exec $es_pod -- curl -s --key /etc/elasticsearch/secret/admin-key --cert /etc/elasticsearch/secret/admin-cert --cacert /etc/elasticsearch/secret/admin-ca https://localhost:9200/_cat/recovery?v
index   shard time type  stage source_host source_node target_host target_node repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
twitter 0     13ms store done  n/a         n/a         node0       node-0      n/a        n/a      0     0               100%          13          0     0               100%          9928        0            0                      100.0%
  • Thread pool. If Elasticsearch is silently dropping records due to bulk index rejections the bulk.rejected will not be zero.
$ oc exec $es_pod -- curl -s --key /etc/elasticsearch/secret/admin-key --cert /etc/elasticsearch/secret/admin-cert --cacert /etc/elasticsearch/secret/admin-ca https://localhost:9200/_cat/thread_pool?v\&h=host,bulk.completed,bulk.rejected,bulk.queue,bulk.active,bulk.queueSize
host        ip          bulk.active bulk.queue bulk.rejected index.active index.queue index.rejected search.active search.queue search.rejected 
192.168.1.100 192.168.1.100           0          0             3            0           0              0             0            0               0 
192.168.1.101 192.168.1.101           0          0             0            0           0              0             0            0               0 
192.168.1.102 192.168.1.102           0          0             0            0           0              0             0            0               0 
  • In 3.11 with Elasticsearch 5, the following command may be required instead:
$ oc exec logging-es-data-master-j94ke80s-1-vsdrj -- es_util --query=_cat/thread_pool

As per the Content from www.elastic.co is not included.Elasticsearch docs, the rightmost three columns show the active, queue, and rejected statistics for each thread pool.

Troubleshoot unassigned shards

Troubleshooting and restoring unassigned shards requires some deep knowledge about how Lucene and Elasticsearch works but here are some error-solution approaches for the most common cases.

  • For a quick explanation of the status and cause of unassigned shards, use the explain API:
$ oc exec  $es_pod  -- es_util --query=_cluster/allocation/explain?pretty
  • For a more hands-on investigation, the first thing to do is to gather the real status of the indices and shards. Refer to the "unassigned shards" query in order to identify them
$ oc exec $es_pod -- curl -s --key /etc/elasticsearch/secret/admin-key --cert /etc/elasticsearch/secret/admin-cert --cacert /etc/elasticsearch/secret/admin-ca https://localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason,node | grep UNASSIGNED
.operations.2017.05.28                                                                    0 p UNASSIGNED CLUSTER_RECOVERED

The reason will tell why they are not assigned but it's not easy to know what to do in order to recover them, and if it is even possible.

Reason: NODE_LEFT

Can be that for some reason the cluster had an extra node and this shard was assigned to that node.

  • Elasticsearch pod scaled to more than 1 by mistake
  • The Elasticsearch cluster for some reason now has one node less
  • The persistent storage has been purged/removed/lost

This is solved by rerouting the shard to the "new" node (pay attention to the index name, the shard id and the target node):

$ oc rsh $es_pod
$ curl --key /etc/elasticsearch/secret/admin-key --cert /etc/elasticsearch/secret/admin-cert --cacert /etc/elasticsearch/secret/admin-ca -XPOST 'https://localhost:9200/_cluster/reroute' -d '{
"commands" : [ {
  "allocate" : {
  "index" : "project.test.8b5046f8-2d88-11e7-ae4a-005056aa59f3.2017.05.15", 
  "shard" : 0, 
  "node" : "10.10.10.101", 
  "allow_primary" : true
  }
}
]
}'

This can also be solved by changing the number of replicas to 0 and then to the expected number

For further steps on rerouting shards, see this related solution

Reason: CLUSTER_RECOVERED

This indicates that while Elasticsearch has recovered and is running, some shards exist that have not been assigned to a node. This frequently occurs for indices that cover periods of time when Elasticsearch was not running, or indices which have been pruned of data (due to age or some other reason) but were not removed from the ES metadata. This can also be caused if data is missing from the persistent storage.

For indices that are not expected to exist anymore (or at all), the indices can be deleted.

Other UNASSIGNED reasons

An shard can be in an unassigned status either because

  • The data is corrupted
  • The data is not there
  • The shard metadata is corrupted or missing
    For the first two cases, if there is no other shard with the data, there is little that can be done and the shard must be deleted. In order to check whether the data is there and is fine, we need to access the persistent storage (data can be under 0 or 1, for many reasons not yet covered by this guide):
$ oc exec $es_pod -- ls -lR /elasticsearch/persistent/logging-es/data/logging-es/nodes/0/indices/$index
project.jenkinscfg.92e04a1e-827b-11e7-8a7b-525400f41652.2017.10.03:
total 0
drwxr-xr-x. 5 1000 root 49 Oct 18 11:17 0
drwxr-xr-x. 2 1000 root 25 Oct 18 11:17 _state

project.jenkinscfg.92e04a1e-827b-11e7-8a7b-525400f41652.2017.10.03/0:
total 0
drwxr-xr-x. 2 1000 root  25 Oct 18 11:17 _state
drwxr-xr-x. 2 1000 root 206 Oct 18 11:22 index
drwxr-xr-x. 2 1000 root  49 Oct 18 11:17 translog

project.jenkinscfg.92e04a1e-827b-11e7-8a7b-525400f41652.2017.10.03/0/_state:
total 4
-rw-r--r--. 1 1000 root 126 Oct 18 11:17 state-25.st

project.jenkinscfg.92e04a1e-827b-11e7-8a7b-525400f41652.2017.10.03/0/index:
total 124
-rw-r--r--. 1 1000 root   363 Oct  3 07:28 _0.cfe
-rw-r--r--. 1 1000 root 19564 Oct  3 07:28 _0.cfs
-rw-r--r--. 1 1000 root   374 Oct  3 07:28 _0.si
-rw-r--r--. 1 1000 root   363 Oct  3 07:28 _1.cfe
-rw-r--r--. 1 1000 root 19904 Oct  3 07:28 _1.cfs
-rw-r--r--. 1 1000 root   374 Oct  3 07:28 _1.si
-rw-r--r--. 1 1000 root   363 Oct  3 07:28 _2.cfe
-rw-r--r--. 1 1000 root 13780 Oct  3 07:28 _2.cfs
-rw-r--r--. 1 1000 root   374 Oct  3 07:28 _2.si
-rw-r--r--. 1 1000 root   363 Oct  3 07:28 _3.cfe
-rw-r--r--. 1 1000 root 29226 Oct  3 07:28 _3.cfs
-rw-r--r--. 1 1000 root   374 Oct  3 07:28 _3.si
-rw-r--r--. 1 1000 root   410 Oct 18 11:22 segments_o
-rw-r--r--. 1 1000 root     0 Oct  3 07:28 write.lock

project.jenkinscfg.92e04a1e-827b-11e7-8a7b-525400f41652.2017.10.03/0/translog:
total 8
-rw-r--r--. 1 1000 root 43 Oct 18 11:17 translog-1.tlog
-rw-r--r--. 1 1000 root 20 Oct 18 11:17 translog.ckp

project.jenkinscfg.92e04a1e-827b-11e7-8a7b-525400f41652.2017.10.03/_state:
total 4
-rw-r--r--. 1 1000 root 2398 Oct 18 11:17 state-21.st

In this case files are listed under this path. Lucene index ($index/0/index) and the translog ($index/0/translog). If there was no such data, the index most probably had to be deleted.

Now let's confirm the health of the index:

 $ index=project.jenkinscfg.92e04a1e-827b-11e7-8a7b-525400f41652.2017.10.03
 $ SHARD_PATH=/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/indices/$index/0/index/
 $ java -cp lib:/usr/share/java/elasticsearch/lib/lucene-core-5.5.2.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex $SHARD_PATH
Segments file=segments_o numSegments=4 version=5.5.2 id=eotlkti4ft2r8tn0f1s0olgay format= userData={sync_id=AV8vOK7AWmARNjQAjneo, translog_generation=1, translog_uuid=-3GVLLRxQk2qELxESctRBQ}
  1 of 4: name=_0 maxDoc=14
    version=5.5.2
    id=9c92b9qnr9xviphjsif15llfm
    codec=Lucene54
    compound=true
    numFiles=3
    size (MB)=0.019
    diagnostics = {java.runtime.version=1.8.0_141-b16, java.vendor=Oracle Corporation, java.version=1.8.0_141, java.vm.version=25.141-b16, lucene.version=5.5.2, os=Linux, os.arch=amd64, os.version=3.10.0-693.el7.x86_64, source=flush, timestamp=1507014508694}
    no deletions
    test: open reader.........OK [took 0.102 sec]
    test: check integrity.....OK [took 0.002 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [32 fields] [took 0.001 sec]
    test: field norms.........OK [6 fields] [took 0.001 sec]
    test: terms, freq, prox...OK [276 terms; 1670 terms/docs pairs; 956 tokens] [took 0.036 sec]
    test: stored fields.......OK [28 total field count; avg 2.0 fields per doc] [took 0.015 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [23 docvalues fields; 0 BINARY; 1 NUMERIC; 0 SORTED; 3 SORTED_NUMERIC; 19 SORTED_SET] [took 0.018 sec]

  2 of 4: name=_1 maxDoc=15
    version=5.5.2
    id=9c92b9qnr9xviphjsif15llfo
    codec=Lucene54
    compound=true
    numFiles=3
    size (MB)=0.02
    diagnostics = {java.runtime.version=1.8.0_141-b16, java.vendor=Oracle Corporation, java.version=1.8.0_141, java.vm.version=25.141-b16, lucene.version=5.5.2, os=Linux, os.arch=amd64, os.version=3.10.0-693.el7.x86_64, source=flush, timestamp=1507014513710}
    no deletions
    test: open reader.........OK [took 0.007 sec]
    test: check integrity.....OK [took 0.000 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [32 fields] [took 0.000 sec]
    test: field norms.........OK [6 fields] [took 0.000 sec]
    test: terms, freq, prox...OK [270 terms; 1807 terms/docs pairs; 1058 tokens] [took 0.032 sec]
    test: stored fields.......OK [30 total field count; avg 2.0 fields per doc] [took 0.001 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [23 docvalues fields; 0 BINARY; 1 NUMERIC; 0 SORTED; 3 SORTED_NUMERIC; 19 SORTED_SET] [took 0.004 sec]

  3 of 4: name=_2 maxDoc=3
    version=5.5.2
    id=9c92b9qnr9xviphjsif15llfr
    codec=Lucene54
    compound=true
    numFiles=3
    size (MB)=0.014
    diagnostics = {java.runtime.version=1.8.0_141-b16, java.vendor=Oracle Corporation, java.version=1.8.0_141, java.vm.version=25.141-b16, lucene.version=5.5.2, os=Linux, os.arch=amd64, os.version=3.10.0-693.el7.x86_64, source=flush, timestamp=1507014518731}
    no deletions
    test: open reader.........OK [took 0.018 sec]
    test: check integrity.....OK [took 0.001 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [32 fields] [took 0.000 sec]
    test: field norms.........OK [6 fields] [took 0.000 sec]
    test: terms, freq, prox...OK [142 terms; 362 terms/docs pairs; 210 tokens] [took 0.010 sec]
    test: stored fields.......OK [6 total field count; avg 2.0 fields per doc] [took 0.000 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [23 docvalues fields; 0 BINARY; 1 NUMERIC; 0 SORTED; 3 SORTED_NUMERIC; 19 SORTED_SET] [took 0.004 sec]

  4 of 4: name=_3 maxDoc=35
    version=5.5.2
    id=9c92b9qnr9xviphjsif15llft
    codec=Lucene54
    compound=true
    numFiles=3
    size (MB)=0.029
    diagnostics = {java.runtime.version=1.8.0_141-b16, java.vendor=Oracle Corporation, java.version=1.8.0_141, java.vm.version=25.141-b16, lucene.version=5.5.2, os=Linux, os.arch=amd64, os.version=3.10.0-693.el7.x86_64, source=flush, timestamp=1507014523755}
    no deletions
    test: open reader.........OK [took 0.013 sec]
    test: check integrity.....OK [took 0.002 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [32 fields] [took 0.000 sec]
    test: field norms.........OK [6 fields] [took 0.000 sec]
    test: terms, freq, prox...OK [443 terms; 4305 terms/docs pairs; 2530 tokens] [took 0.031 sec]
    test: stored fields.......OK [70 total field count; avg 2.0 fields per doc] [took 0.004 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [23 docvalues fields; 0 BINARY; 1 NUMERIC; 0 SORTED; 3 SORTED_NUMERIC; 19 SORTED_SET] [took 0.011 sec]

No problems were detected with this index.

Took 0.441 sec total.

As no problems were detected the problem should only be related to Elasticsearch metadata so we can just reroute the shard to the same node:

curl -XPOST 'https://localhost:9200/_cluster/reroute' -d '{
"commands" : [ {
  "allocate" : {
  "index" : "project.jenkinscfg.92e04a1e-827b-11e7-8a7b-525400f41652.2017.10.03", 
  "shard" : 0, 
  "node" : "10.128.2.127", 
  "allow_primary" : true
  }
}
]
}'

For further information on rerouting shards, see this related solution

Elasticsearch is out of disk

Elasticsearch has a so called disk watermark configured which is a disk capacity threshold that once reached this node will not store new replicas.
A log similar to this can be seen

[2017-05-22 08:28:37,222][INFO ][cluster.routing.allocation.decider] [Agron] low disk watermark [85%] exceeded on [HApfFHgJR36Fd2pRFd1iWQ][Agron][/elasticsearch/persistent/logging-es/data/logging-es/nodes/0] free: 36.3gb[14.5%], replicas will not be assigned to this node

The consequences are:

  • Fluentd cannot send logs (for new indices)
  • Kibana cannot create user data for a new user
  • SearchGuard cannot be initialized

Once way to discard this problem is by checking the current value of the disk watermark:

  • In the configuration file. The values showed are the default ones meaning that if they are not set, those values will be used.
  elasticsearch.yml: |
    cluster:
      name: ${CLUSTER_NAME}
      routing.allocation.disk.threshold_enabled: true
      routing.allocation.disk.watermark.low: 85%
      routing.allocation.disk.watermark.high: 90%
  • Using the API. If nothing is shown the values by default or set in the configuration will be set.
# oc exec $es_pod -- curl -s --key /etc/elasticsearch/secret/admin-key --cert /etc/elasticsearch/secret/admin-cert --cacert /etc/elasticsearch/secret/admin-ca https://localhost:9200/_cluster/settings?pretty
{
  "persistent" : { },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "disk" : {
            "watermark" : {
              "low" : "65%"
            }
          }
        }
      }
    }
  }
}
  • Check the available space:
$ for pod in `oc get po -l component=es -o jsonpath='{.items[*].metadata.name}'`; do echo $pod; oc exec $pod -- df -h /elasticsearch/persistent; done
logging-es-qzw8xmt0-16-4drmg
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        10G  6.7G  3.4G  67% /elasticsearch/persistent
logging-es-smeuexjr-16-vc9cv
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        10G  7.7G  2.4G  77% /elasticsearch/persistent

If persistent storage is over the threshold some space must be freed. Before doing so, pods need to be scaled down. This can be done pod by pod.

Debugging Fluentd

Having Fluentd with log_level=trace it will be possible to view things like:

  • Full configuration file being loaded.
2017-11-22 06:20:03 -0500 [info]: using configuration file: <ROOT>
  <system>
    log_level trace
  </system>
...
</ROOT>
  • Connection status with Elasticsearch:
2017-11-22 06:20:05 -0500 [info]: Connection opened to Elasticsearch cluster => {:host=>"logging-es", :port=>9200, :scheme=>"https", :user=>"fluentd", :password=>"obfuscated"}
  • Plugins being loaded and their versions:
2017-11-22 06:19:53 -0500 [info]: gem 'fluent-plugin-elasticsearch' version '1.9.2'
2017-11-22 06:19:53 -0500 [info]: gem 'fluent-plugin-flatten-hash' version '0.4.0'
2017-11-22 06:19:53 -0500 [info]: gem 'fluent-plugin-kubernetes_metadata_filter' version '0.26.2'
2017-11-22 06:19:53 -0500 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '1.5.5'
2017-11-22 06:19:53 -0500 [info]: gem 'fluent-plugin-secure-forward' version '0.4.3'
2017-11-22 06:19:53 -0500 [info]: gem 'fluent-plugin-systemd' version '0.0.6'
2017-11-22 06:19:53 -0500 [info]: gem 'fluent-plugin-viaq_data_model' version '0.0.3'
2017-11-22 06:19:53 -0500 [info]: gem 'fluentd' version '0.12.39'
...
2017-11-22 06:20:02 -0500 [trace]: registered output plugin 'elasticsearch'
2017-11-22 06:20:02 -0500 [trace]: registered output plugin 'elasticsearch_dynamic'

By default Fluentd is configured to log only warnings and errors as can be seen in the system.conf file:

$ oc exec <fluentd-pod> -- cat /etc/fluent/configs.d/openshift/system.conf 
<system>
  log_level warn
</system>

In order change this configuration without having to generate a new docker image it will be enough to edit fluentd configmap:

$ oc edit configmap/logging-fluentd

Comment the include and add the desired log_level

...
# @include configs.d/openshift/system.conf
<system>
  log_level trace
</system>
...

Delete fluentd pods

$ oc delete pods -l component=fluentd

fluentd in CrashLoopBackoff status

If fluentd is in CrashLoopBackoff status, check this article

Debugging Kibana

As Kibana pod has two components (kibana and kibana-proxy) it is possible to configure their verbosity separately.

  • Kibana Proxy: Setting kibana-proxy container to debug will show information like:
    • Memory configuration
    • TLS configuration and certificates being used

The kibana-proxy container uses an environment variable defined in configmap called OAP_DEBUG. This variable takes its value from the Ansible variable (starting in 3.5) called openshift_logging_kibana_proxy_debug.
In order to set the value permanently (across upgrades or playbook executions) it will be better to use the ansible variable, otherwise just setting the value manually is enough:

$ oc set env deploymentconfig/logging-kibana OAP_DEBUG=true

After that, a new deployment will take place for logging-kibana pod and debug logs will start being shown:

$ oc logs -c kibana-proxy <logging-kibana-pod>
+ BYTES_PER_MEG=1048576
+ BYTES_PER_GIG=1073741824
+ DEFAULT_MIN=268435456
+ echo 100663296
+ grep -qE '^([[:digit:]]+)([GgMm])?i?$'
50331648 is less than the default 256m.  Setting to default.
Using NODE_OPTIONS: '--max_old_space_size=256' Memory setting is in MB
...
  • Kibana: Changing log verbosity in Kibana container is not currently possible without extending the logging-kibana image. Log verbosity is configured by changing the value of the following variable in /etc/kibana/kibana.yml file:
# Set this to true to log all events, including system usage information and all requests.
# logging.verbose: false

Debugging Elasticsearch

Elasticsearch logging is configured through a file defined in the configmap logging-elasticsearch

$ oc get configmap/logging-elasticsearch -o yaml | grep logging.yml -A6
  logging.yml: |
    # you can override this using by setting a system property, for example -Des.logger.level=DEBUG
    es.logger.level: INFO
    rootLogger: ${es.logger.level}, file
    logger:
      # log action execution errors for easier debugging
      action: WARN

Bear in mind that Elasticsearch is configured to use different files together with the standard output. These files are found under /elasticsearch/logging-es/logs/ or /elasticsearch/persistent/logging-es/logs/ inside the pods. {cluster-name} will be either logging-es or logging-es-ops, depending on the pod:

- {cluster-name}_deprecation.log --> Logging of deprecated actions that should be migrated in the future
- {cluster-name}_index_indexing_slowlog.log --> Logs information about slow indexing
- {cluster-name}_index_search_slowlog.log --> Logs slow searches
- {cluster-name}.log --> Application log

With that said, it is possible to change the application log level of Elasticsearch by updating the following property in the configmap:

 es.logger.level: DEBUG

In order for the pods to start using the new value, deploymentconfigs need to be re-scaled (See Basic maintenance operations: Restart the Elasticsearch cluster). Examples of information being logged out are:

  • Network setup (interfaces)
  • Cluster communication
  • Shards state
  • Indices detailed information
[2017-11-22 12:28:45,603][DEBUG][common.netty             ] using gathering [true]
[2017-11-22 12:28:45,642][DEBUG][discovery.zen.elect      ] [Att-Lass] using minimum_master_nodes [2]
[2017-11-22 12:28:45,644][DEBUG][discovery.zen.ping.unicast] [Att-Lass] using initial hosts [127.0.0.1, [::1]], with concurrent_connects [10]
[2017-11-22 12:28:45,668][DEBUG][discovery.zen.fd         ] [Att-Lass] [master] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2017-11-22 12:28:45,670][DEBUG][discovery.zen.fd         ] [Att-Lass] [node  ] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
...
[2017-11-22 12:28:57,055][DEBUG][gateway                  ] [Att-Lass] [.searchguard.logging-es-qzw8xmt0-12-3k47w][0] shard state info found: [version [110], primary [true]]
[2017-11-22 12:29:01,946][DEBUG][indices.cluster          ] [Att-Lass] [.operations.2017.11.13] creating index
[2017-11-22 12:29:01,946][DEBUG][indices                  ] [Att-Lass] creating Index [.operations.2017.11.13], shards [1]/[0]
[2017-11-22 12:29:01,965][DEBUG][index.store              ] [Att-Lass] [.operations.2017.11.13] using index.store.throttle.type [none], with index.store.throttle.max_bytes_per_sec [0b]
[2017-11-22 12:29:01,968][DEBUG][index.mapper             ] [Att-Lass] [.operations.2017.11.13] using dynamic[true]
...

SearchGuard

  • SearchGuard is not initialized due to UnavailableShardsException
Will update 'config' with /opt/app-root/src/sgconfig/sg_config.yml
   FAIL: Configuration for 'config' failed because of UnavailableShardsException[[.searchguard.logging-es-data-master-76fy2ctp][0] Not enough active copies to meet write consistenc                            2). Timeout: [1m], request: [index {[.searchguard.logging-es-data-master-76fy2ctp][config][0], source[{"searchguard":{"dynamic":{"http":{"xff":{"enabled":true,"remoteIpHeader":"x-                           s":".*","internalProxies":".*"}},"authc":{"authentication_domain_proxy":{"enabled":true,"order":0,"http_authenticator":{"challenge":false,"type":"proxy","config":{"user_header":"x-                           ication_backend":{"type":"noop"}},"authentication_domain_basic_internal":{"enabled":true,"order":1,"http_authenticator":{"type":"clientcert","challenge":false},"authentication_back

This is possibly caused by a lack of available space in one or more of the Elasticsearch nodes. Refer to Elasticsearch is out of disk

Kibana related KCS

Fluentd related KCS

Elasticsearch related KCS

Customizing log destinations

Aggregated logging dump tool

There is a script that can be used to gather most of the information that is relevant for troubleshooting common problems of this nature. The script can be found in the Aggregated Logging Utilities repository.

An example of use would be:

//OpenShift 3.x
$ wget https://raw.githubusercontent.com/openshift/origin-aggregated-logging/release-3.11/hack/logging-dump.sh

$ chmod +x logging-dump.sh
$ oc login -u admin -p <password> https://openshift.example.com:8443
$ ./logging-dump.sh
SBR
Category
Article Type