How to troubleshoot the message, Elasticsearch errors returned, retrying
Intro
Troubleshooting the OpenShift Logging stack (EFK) can involve a number of different tactics. Frequently, the fluentd logs will exhibit very large messages starting with the line, "Elasticsearch errors returned, retrying", which can be hard to parse. This article will walk you through parsing these messages to find the root cause.
For more general EFK troubleshooting, please see the overview troubleshooting articles:
Entering Fluentd debug mode
To get the relevant log rejection, it is best to set fluentd into debug mode.
First, edit the configmap:
# oc edit cm logging-fluentd
Comment out the following line and add:
#@include configs.d/openshift/system.conf
<system>
log_level debug
</system>
Restart the fluentd pods:
# oc delete pod -l logging-infra=fluentd
Gathering and formatting fluentd log
When fluentd fails to send logs to Elasticsearch, it will often print a long message like:
2019-09-05 19:06:51 +0800 [debug]: Elasticsearch errors returned, retrying: {"took"=>10096, "errors"=>true, "items"=>[{"create"=>{"_index"=>".operations.2018.08.17", "_type"=>"com.redhat.viaq.common", "_id"=>"M2NlOGRjYzctMTdmMS00OTYxLWJj
MjMtN2M5MjYzN2U5OTky", "status"=>409, "error"=>{"type"=>"document_already_exists_exception", "reason"=>"[com.redhat.viaq.common][M2NlOGRjYzctMTdmMS00OTYxLWJjMjMtN2M5MjYzN2U5OTky]: document already exists", "shard"=>"0", "index"=>".operati
ons.2018.08.17"}}}, {"create"=>{"_index"=>"project.logging.xxxxxxxx-xxxx-xxxx-bbc7-005056aeb63e.2018.08.17", "_type"=>"com.redhat.viaq.common", "_id"=>"NzA4ZGZhYTItZGExYS00NTBhLWE4NTQtZTI2OWIzZjVmMDkx", "status"=>409, "error"=>{"type"=>"d
ocument_already_exists_exception", "reason"=>"[com.redhat.viaq.common][NzA4ZGZhYTItZGExYS00NTBhLWE4NTQtZTI2OWIzZjVmMDkx]: document already exists", "shard"=>"2", "index"=>"project.logging.xxxxxxxx-xxxx-xxxx-bbc7-005056aeb63e.2018.08.17"}}
}, ....
...... "reason"=>"[com.redhat.viaq.common][ZTQ5NGVlN2MtMjllMy00Zjg3LWI5OTAtMjhjMTg0ZjRhMzI4]: document already exists", "shard"=>"1", "index"=>"project.logging.xxxxxxxx-xxxx-xxxx-bbc7-005056aeb63e.2018.08.17"}}}]}
Note, these are VERY big lines. In my example, this line is actually 20k+ characters.
When you see very long lines like this in fluentd, where it starts with Elasticsearch errors returned, it prints out the full buffered chunk that it tried to send to Elasticsearch. This line is showing us what fluentd tried to send to ES. In later versions of the EFK stack, fluentd will not try to resend the successful messages, and be able to move on to more log messages. In older versions of the EFK stack, fluentd will continuously try to resent this chunk.
Save just this line to a file. For example, you could get just the first instance of the error:
# cat fluentd-example.log | grep "Elasticsearch errors returned" | head -n1 > error-line.log
You can make this message more user-friendly, by deleting everything before the first { character. So for example, in vim, delete the starting characters in the line, so that it starts with {"took"=> ..... Then save the file.
Change the => characters to : using:
# sed -i s/\=\>/\:/g file.log
This turns the log into JSON format! Now, we can view it in a pretty state:
# cat file.log | jq . | less
{
"took": 10096,
"errors": true,
"items": [
{
"create": {
"_index": ".operations.2018.08.17",
"_type": "com.redhat.viaq.common",
"_id": "M2NlOGRjYzctMTdmMS00OTYxLWJjMjMtN2M5MjYzN2U5OTky",
"status": 409,
"error": {
"type": "document_already_exists_exception",
"reason": "[com.redhat.viaq.common][M2NlOGRjYzctMTdmMS00OTYxLWJjMjMtN2M5MjYzN2U5OTky]: document already exists",
"shard": "0",
"index": ".operations.2018.08.17"
}
}
},
. . .
Understanding the logs
Every message to be sent has an entry. Most of them will be status 409. This means that Elasticsearch successfully received them already; since this is probably a retransmission, ES is telling fluentd that it already has this log message. The problem is, in older versions of EFK stack, if a single log line is corrupt or problematic, fluentd will still try to send the whole set again. That's why we see "status 409" on a lot of logs.
The actual root cause we want to look for is any line with a status other than 409.
# cat file.log | jq . | grep status | grep -v 409
"status": 400,
In this example, we have 1 problematic log, with status 400. So we can search for it:
# cat file.log | jq . | less
Use the / key to search for the 400 record:
{
"create": {
"_index": "project.logging.xxxxxxxx-xxxx-xxxx-bbc7-005056aeb63e.2018.08.17",
"_type": "com.redhat.viaq.common",
"_id": "YTIzNzRmOTEtNDNkZi00Zjc1LWFhY2YtYmVjN2U5ZWJhNDIx",
"status": 400,
"error": {
"type": "mapper_parsing_exception",
"reason": "failed to parse [prevMsg]",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "unknown property [bytesParsed]"
}
}
}
},
This is the log which is causing the problem. If this is a user-provided application, it is worth investigating the "reason" and trying to find why the logs are not accepted.
Issues with MERGE JSON LOG
Commonly, this is caused when using MERGE_JSON_LOG, which can cause key:value conflicts if the logs are not formatted correctly. (For example, if your application logs JSON-formatted log messages, and lets say you have one message:
{"name": "employee1", "job": "engineer"}
And another message:
{"name": "employee2", "job": 12345}
Elasticsearch reads the first line, sees that job is the key for a string variable, and encodes job as a string field. When it gets the second string, it sees an integer, which is not a string, so it fails.
That can be addressed on the application level. However, if modifying the app is not possible, or if the cluster is at risk of untrusted users causing this issue, the administrator-level solution to solve this is to disable MERGE_JSON_LOG.
There have been a few known issues on MERGE_JSON_LOG. For example:
- Constantly changing keys in the
key:valuepair causing an excessive number of update mappings: - Conflicts in
keytypes, i.e. betweenstring,int, and other variables all using the samekey.
Disabling MERGE_JSON_LOG requires different steps depending on the version.
- 3.9+
In later versions, this is set using an environment variable:
# oc set env ds/logging-fluentd MERGE_JSON_LOG=false --overwrite=true
We now set this to false by default.
- 3.7
See this related solution
This solution is a workaround and might require some extra effort when upgrading OCP.