SearchGuard Not Yet Initialized in OpenShift Logging Stack
Environment
- Red Hat OpenShift Container Platform
- 3.x
- EFK stack
- 3.4.1
- 3.5.1+
Issue
- Kibana reports "Unable to connect to elasticsearch at https://logging-es:9200"
- Searchguard not initialized messages in the logging-es pod logs:
[INFO ][index.store ] [Jens Meilleur Slap Shot] [project.example.bfc38003-c309-11e6-b098-0050568f5a25.2017.05.28][0] Failed to open / find files while reading metadata snapshot
[ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized
[WARN ][index.store ] [Jens Meilleur Slap Shot] [project.example.bfc38003-c309-11e6-b098-0050568f5a25.2017.05.26][0] failed to build store metadata. checking segment info integrity (with commit [no])
...
[ERROR][com.floragunn.searchguard.action.configupdate.TransportConfigUpdateAction] [Trapper] Timeout java.util.concurrent.TimeoutException: Timeout after 30SECONDS while retrieving configuration for [config, roles, rolesmapping, internalusers, actiongroups](index=.searchguard.logging-es-06coa0un-6-4znyf)
java.util.concurrent.TimeoutException: Timeout after 30SECONDS while retrieving configuration for [config, roles, rolesmapping, internalusers, actiongroups](index=.searchguard.logging-es-06coa0un-6-4znyf)
...
[ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized
[ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized
- Messages like
Timeout after 30SECONDS
[2017-06-28 15:03:17,711][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized (you may need to run sgadmin)
- Messages like
Failure No shard available
java.util.concurrent.TimeoutException: Timeout after 30SECONDS while retrieving configuration for [roles, rolesmapping](index=.searchguard)
[ERROR][i.f.e.p.a.ConfigurationLoader] Failure No shard available for [org.elasticsearch.action.get.MultiGetShardRequest@4c256fc] retrieving configuration for [roles, rolesmapping] (index=.searchguard)
Resolution
- Before you execute the lines below please set the OpenShift logging namespace.
# Before 3.10
export OC_LOGGING=logging
# since 3.10
export OC_LOGGING=openshift-logging
- Delete all SearchGuard indices related to the running elasticsearch pods: Run the following commands in each elasticsearch pod (use component=es-ops for the ops version)
for pod in `oc -n ${OC_LOGGING} get pods --selector=component=es -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}'`
do
echo Deleting SG index for $pod
oc -n ${OC_LOGGING} exec $pod -- curl --key /etc/elasticsearch/secret/admin-key --cert /etc/elasticsearch/secret/admin-cert --cacert /etc/elasticsearch/secret/admin-ca -XDELETE https://localhost:9200/.searchguard.$pod
done
This will delete the SearchGuard indices from the already-terminated pod, allowing the cluster to initialize properly. To get all SearchGuard indices:
# anypod=$(oc -n ${OC_LOGGING} get pods --selector=component=es -o jsonpath='{.items[0].metadata.name}')
# oc -n ${OC_LOGGING} exec $anypod -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es:9200/_cat/indices?v | grep searchguard > sg_indices.out
- Scale down ES pods
- Scale up ES pods
- Delete FluentD pods
If the above does not work, re-starting the SearchGuard initialization can help the ES cluster re-sync.
# oc -n openshift-logging rsh <elasticsearch pod>
# es_seed_acl
In older versions, the above command is not available. In that case, run the below in every ElasticSearch pod, simultaneously:
# oc -n openshift-logging rsh <elasticsearch pod>
# /usr/share/java/elasticsearch/plugins/search-guard-2/tools/sgadmin.sh \
-cd ${HOME}/sgconfig \
-i .searchguard.${HOSTNAME} \
-ks /etc/elasticsearch/secret/searchguard.key \
-kst JKS \
-kspass kspass \
-ts /etc/elasticsearch/secret/searchguard.truststore \
-tst JKS \
-tspass tspass \
-nhnv \
-icl
- In version 3.9 sgadmin.sh is located in /usr/share/elasticsearch/plugins/openshift-elasticsearch/sgadmin.sh so may be necessary to run 'find' to locate it on your particular OCP version.
- It is possible that it will require running this script multiple times.
- In the later versions, also you find these details mentioned above working.
Root Cause
The SearchGuard index is part of the infrastructure of the logging cluster; these indices get thrown away and rebuilt when the pod respawns, and do not contain any application data. Although ElasticSearch views this as blocking the cluster from starting up, in reality this is likely just an index that should have deleted but did not when the pod terminated.
This is being This content is not included.investigated for ways to prevent this desyncing.
Errors of type Timeout after 30SECONDS means that the query took too long to complete. When loading the Kibana Discovery page a query with all the user information is retrieved and this is taken from the SearchGuard index. A timeout can be caused by many reasons, for example:
- Big load, due to a data recovery for example
- Lack of resources (memory, cpu)
- Network problem
Diagnostic Steps
- Before you execute the lines below please set the OpenShift logging namespace.
# Before 3.10
export OC_LOGGING=logging
# since 3.10
export OC_LOGGING=openshift-logging
- Run ElasticSearch cluster information queries
# oc -n ${OC_LOGGING} exec <ES PODNAME> -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/health?v
# oc -n ${OC_LOGGING} exec <ES PODNAME> -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es:9200/_cat/indices?v
- Check for any red indices
green open .searchguard.logging-es-m50k78tj-6-9oydw 1 2 3 0 42kb 14kb
green open .kibana.968528fe24c0a4b52766850041d6b0722917f563 1 2 9 0 105.5kb 35.1kb
red open .searchguard.logging-es-06coa0un-6-7kc0b 1 2
green open project.management-infra.54281e56-1c5a-11e6-a509-0050568f4c2d.2017.06.05 1 2 53 0 342.3kb 114.1kb
green open project.example.678ce11f-c309-11e6-b098-0050568f5a25.2017.06.09 1 2 184 0 413.5kb 137.3kb
- Note the red index. If the red index is a searchguard index, proceed to Resolution step. If the red index is an application index, deleting it can cause loss of application logs.
- Another possibility is that the SearchGuard indices don't have all the information required for the users authentication. Usually a SearchGuard index might contain 5 documents. See this example of a healthy and populated index.
green open .searchguard.logging-es-smeuexjr-2-qmc16 1 2 5 0 66kb 66kb
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.