How to troubleshoot the message, Elasticsearch errors returned, retrying

Updated

Intro

Troubleshooting the OpenShift Logging stack (EFK) can involve a number of different tactics. Frequently, the fluentd logs will exhibit very large messages starting with the line, "Elasticsearch errors returned, retrying", which can be hard to parse. This article will walk you through parsing these messages to find the root cause.

For more general EFK troubleshooting, please see the overview troubleshooting articles:

Entering Fluentd debug mode

To get the relevant log rejection, it is best to set fluentd into debug mode.

First, edit the configmap:

# oc edit cm logging-fluentd

Comment out the following line and add:

    #@include configs.d/openshift/system.conf
    <system>
      log_level debug
    </system>

Restart the fluentd pods:

# oc delete pod -l logging-infra=fluentd

Gathering and formatting fluentd log

When fluentd fails to send logs to Elasticsearch, it will often print a long message like:

2019-09-05 19:06:51 +0800 [debug]: Elasticsearch errors returned, retrying:  {"took"=>10096, "errors"=>true, "items"=>[{"create"=>{"_index"=>".operations.2018.08.17", "_type"=>"com.redhat.viaq.common", "_id"=>"M2NlOGRjYzctMTdmMS00OTYxLWJj
MjMtN2M5MjYzN2U5OTky", "status"=>409, "error"=>{"type"=>"document_already_exists_exception", "reason"=>"[com.redhat.viaq.common][M2NlOGRjYzctMTdmMS00OTYxLWJjMjMtN2M5MjYzN2U5OTky]: document already exists", "shard"=>"0", "index"=>".operati
ons.2018.08.17"}}}, {"create"=>{"_index"=>"project.logging.xxxxxxxx-xxxx-xxxx-bbc7-005056aeb63e.2018.08.17", "_type"=>"com.redhat.viaq.common", "_id"=>"NzA4ZGZhYTItZGExYS00NTBhLWE4NTQtZTI2OWIzZjVmMDkx", "status"=>409, "error"=>{"type"=>"d
ocument_already_exists_exception", "reason"=>"[com.redhat.viaq.common][NzA4ZGZhYTItZGExYS00NTBhLWE4NTQtZTI2OWIzZjVmMDkx]: document already exists", "shard"=>"2", "index"=>"project.logging.xxxxxxxx-xxxx-xxxx-bbc7-005056aeb63e.2018.08.17"}}
}, ....
...... "reason"=>"[com.redhat.viaq.common][ZTQ5NGVlN2MtMjllMy00Zjg3LWI5OTAtMjhjMTg0ZjRhMzI4]: document already exists", "shard"=>"1", "index"=>"project.logging.xxxxxxxx-xxxx-xxxx-bbc7-005056aeb63e.2018.08.17"}}}]}

Note, these are VERY big lines. In my example, this line is actually 20k+ characters.

When you see very long lines like this in fluentd, where it starts with Elasticsearch errors returned, it prints out the full buffered chunk that it tried to send to Elasticsearch. This line is showing us what fluentd tried to send to ES. In later versions of the EFK stack, fluentd will not try to resend the successful messages, and be able to move on to more log messages. In older versions of the EFK stack, fluentd will continuously try to resent this chunk.

Save just this line to a file. For example, you could get just the first instance of the error:

# cat fluentd-example.log | grep "Elasticsearch errors returned" | head -n1 > error-line.log

You can make this message more user-friendly, by deleting everything before the first { character. So for example, in vim, delete the starting characters in the line, so that it starts with {"took"=> ..... Then save the file.

Change the => characters to : using:

# sed -i s/\=\>/\:/g file.log

This turns the log into JSON format! Now, we can view it in a pretty state:

# cat file.log | jq . | less
{
  "took": 10096,
  "errors": true,
  "items": [
    {
      "create": {
        "_index": ".operations.2018.08.17",
        "_type": "com.redhat.viaq.common",
        "_id": "M2NlOGRjYzctMTdmMS00OTYxLWJjMjMtN2M5MjYzN2U5OTky",
        "status": 409,
        "error": {
          "type": "document_already_exists_exception",
          "reason": "[com.redhat.viaq.common][M2NlOGRjYzctMTdmMS00OTYxLWJjMjMtN2M5MjYzN2U5OTky]: document already exists",
          "shard": "0",
          "index": ".operations.2018.08.17"
        }
      }
    },
. . .

Understanding the logs

Every message to be sent has an entry. Most of them will be status 409. This means that Elasticsearch successfully received them already; since this is probably a retransmission, ES is telling fluentd that it already has this log message. The problem is, in older versions of EFK stack, if a single log line is corrupt or problematic, fluentd will still try to send the whole set again. That's why we see "status 409" on a lot of logs.

The actual root cause we want to look for is any line with a status other than 409.

# cat file.log | jq . | grep status | grep -v 409
        "status": 400,

In this example, we have 1 problematic log, with status 400. So we can search for it:

# cat file.log | jq . | less

Use the / key to search for the 400 record:

    {
      "create": {
        "_index": "project.logging.xxxxxxxx-xxxx-xxxx-bbc7-005056aeb63e.2018.08.17",
        "_type": "com.redhat.viaq.common",
        "_id": "YTIzNzRmOTEtNDNkZi00Zjc1LWFhY2YtYmVjN2U5ZWJhNDIx",
        "status": 400,
        "error": {
          "type": "mapper_parsing_exception",
          "reason": "failed to parse [prevMsg]",
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "unknown property [bytesParsed]"
          }
        }
      }
    },

This is the log which is causing the problem. If this is a user-provided application, it is worth investigating the "reason" and trying to find why the logs are not accepted.

Issues with MERGE JSON LOG

Commonly, this is caused when using MERGE_JSON_LOG, which can cause key:value conflicts if the logs are not formatted correctly. (For example, if your application logs JSON-formatted log messages, and lets say you have one message:

{"name": "employee1", "job": "engineer"}

And another message:

{"name": "employee2", "job": 12345}

Elasticsearch reads the first line, sees that job is the key for a string variable, and encodes job as a string field. When it gets the second string, it sees an integer, which is not a string, so it fails.

That can be addressed on the application level. However, if modifying the app is not possible, or if the cluster is at risk of untrusted users causing this issue, the administrator-level solution to solve this is to disable MERGE_JSON_LOG.

There have been a few known issues on MERGE_JSON_LOG. For example:

  • Constantly changing keys in the key:value pair causing an excessive number of update mappings:
  • Conflicts in key types, i.e. between string, int, and other variables all using the same key.

Disabling MERGE_JSON_LOG requires different steps depending on the version.

  • 3.9+
    In later versions, this is set using an environment variable:
  # oc set env ds/logging-fluentd MERGE_JSON_LOG=false --overwrite=true

We now set this to false by default.

  • 3.7

See this related solution
This solution is a workaround and might require some extra effort when upgrading OCP.

SBR
Category
Components
Article Type