OpenShift Container Platform Cluster Metrics: Common Issues
Updated
Metrics for OpenShift Enterprise and OpenShift Container Platform v3 can have several different types of issues and can fail due to a number of different reasons, please try and use the following headings to determine what describes your issue best:
- Deployer Issues
- Metrics Cannot Attach and/or Mount the Intended Storage
- Metrics fails to deploy with Persistent Storage when PVC is self-created
- Pods Failing to Start
- Metrics From Some Nodes
- Metrics From Some Nodes option 2
- Checking the version of metrics images
- Do not use the latest tag for the metrics image version
- Metrics Shows Empty Charts After a Few Minutes
- Metrics Charts Show Small Gaps in the Metrics
- Updating From Previous Versions of OpenShift
- Heapster "Cannot Find A Node"
- Metrics public URL is accessible but metrics tab says Metrics are not available and 504 Gateway Timeout
- Heapster - Increase application verbosity
- Checking the health of the Cassandra Cluster
- Cassandra pod failing with corruption errors
- Hawkular-cassandra disk is not getting cleaned up automatically
- Cassandra sees extra nodes
- Hawkular-metrics presenting issues communicating with cassandra
- Hawkular-Metrics has NullPointerException reaching cassandra
Deployer Issues
- Check the logs and the events for the deployer pod to see what information can be collected as well as the specific error code (if possible)
- The logs should be able to be collected using
oc logs <NAME_OF_POD> - The events should be collected within 4 hours of running the deployer and can be done so with
oc desribe pod <NAME_OF_POD>
- The logs should be able to be collected using
- If the logs indicate that it cannot find the image, then you simply need to add the image specification as shown in that KCS.
- If the events indicate that it failed on mounting the volume you provided for persistent storage, then please follow this KCS, which outlines what might have gone wrong.
Metrics Cannot Attach and/or Mount the Intended Storage
- Metrics needs to have the appropriate permissions/ownership set on the storage provided to the cassandra pod.
- As explained in this kcs, the pod, and the SCC it uses, can dictate the group and/or user IDs and therefore you might need to chown the storage to someone else to get cassandra working.
Metrics fails to deploy with Persistent Storage when PVC is self-created
- The Metrics Deployer creates its own PersistentVolumeClaim when you indicate that you want to use persistent storage.
- This means that if you have manually created your own PVC and let it bind to the PV you created for Metrics, the deployer will not be able to successfully connect the Cassandra pods to the storage, as it will already be bound.
- You can read more about this here
Pods Failing to Start
- Check the logs for each pods (heapster, hawkular-metrics, and hawkular-cassandra) for any specific errors.
- Also check the events for the cluster (or pod, if you are using version 3.2+) for more information about the status of the pods.
# oc project openshift-infra
# oc get pods
NAME READY STATUS RESTARTS AGE
hawkular-cassandra-1-<HASH> 1/1 Running 7 33d
hawkular-metrics-<HASH> 1/1 Running 18 33d
heapster-<HASH> 1/1 Running 8 33d
# oc logs heapster-<HASH> &> heapster.logs
# oc logs hawkular-metrics-<HASH> &> hawkular-metrics.logs
# oc logs hawkular-cassandra-<HASH> &> hawkular-cassandra.logs
# oc get events
*****In version 3.2 we introduced a feature that ties events to pods as well in the describe feature*****
# oc describe pods heapster-<HASH> &> heapster.events
# oc describe pods hawkular-metrics-<HASH> &> hawkular-metrics.events
# oc describe pods hawkular-cassandra-<HASH> &> hawkular-cassandra.events
- Cassandra is the main component for metrics. The other two (Heapster and Hawkular-Metrics) depend on Cassandra getting into a running state.
- With that said, Heapster also depends on Hawkular-Metrics to be running.
- Because of this fact, if the cassandra pod is in a failed/pending state, then those are the logs and events are going to be the most important as the other pods depend on that one functioning.
Metrics From Some Nodes
- Make sure that the nodes are using an NTP server to synchronize their clocks (and verify that the clocks are the same).
- This is because the nodes themselves only store metrics for around 2 minutes, so if Heapter requests the metrics from a node for time 10:00 - 10:02, but the node's clock is currently set to 10:05, then it will return an empty list of metrics.
- NOTEif the Heapster container(events and logs) do not show any metrics, then this is not the issue (since the clock that Heapster has is it's node's clock)
Metrics From Some Nodes option 2
- Make sure that all of the nodes have the ports open on the nodes (and any firewall) for IPv4 Traffic as heapster needs to be able to communicate directly with each system's cAdvisor in order to collect the metrics data for each node
Checking the version of metrics images
- If you are encountering issues with your existing metrics cluster, you should make sure that you are using the newest available metrics version that relates to the version of OpenShift you are running.
- You can see more details about how to check this/correct this, in this KCS
Do not use the latest tag for the metrics image version
- With regard to the Cluster Metrics images, the latest tag pulls the latest version available of these images, rather than the latest version for the OpenShift version you are running.
-This can cause issues as these images are not tested for older versions of OpenShift Container Platform. - You can read more about this here.
Metrics Shows Empty Charts After a Few Minutes
- Heapster is collecting data too rapidly, with a process that takes too long. The solution is to increase the time it takes before new data is collected by increasing the
metrics_resolutionofstats_resolutionvalue in the ReplicationController for Heapster.metrics_resolutionis for versions 3.2.1 and beyond, whilestats_resolutionis for versions up to, and including, 3.2.0- You can refer to this solution for specific details.
Metrics Charts Show Small Gaps in the Metrics
- The issue is actually the inverse of Metrics Shows Empty Charts After a Few Minutes and therefore the way to fix the issue to decrease the
metrics_resolutionofstats_resolutionvalue by following the same process mentioned in the section above.metrics_resolutionis for versions 3.2.1 and beyond, whilestats_resolutionis for versions up to, and including, 3.2.0- You can refer to this solution for specific details.
Updating From Previous Versions of OpenShift
- Updating between versions, eg 3.1.1 to 3.2.0, 3.2.0 to 3.2.1, make sure to follow This page is not included, but the link has been rewritten to point to the nearest parent document.the docs.
- Failure to follow the docs, can result in strange failures like the pods failing to start due to missing files, or other general strange behavior. Docker images can be tightly coupled to a specific template, and without updating them in tandem we can run into problems.
Heapster "Cannot Find A Node"
- Please confirm that the error message is the same (or very similar) to the one outlined in our solution about this and then rest easy knowing that this is a false alarm caused by an individual node metric not being reported and therefore not an actual issue.
Metrics public URL is accessible but metrics tab says Metrics are not available and 504 Gateway Timeout
-
Make sure the
viewrole is added to hawkular service account.$ oc get rolebinding -n openshift-infra $ oc policy add-role-to-user view system:serviceaccount:openshift-infra:hawkular -n openshift-infra
Heapster - Increase application verbosity
- In some cases it might be necessary to increase the application verbosity. For that, the following change is necessary to be done in the
replicationcontroller/heapster
$ oc edit rc heapster
- Then add the following two lines after the sink attribute:
- command:
- heapster-wrapper.sh
- --wrapper.allowed_users_file=/secrets/heapster.allowed-users
- --source=kubernetes.summary_api:${MASTER_URL}?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250
- --tls_cert=/secrets/heapster.cert
- --tls_key=/secrets/heapster.key
- --tls_client_ca=/secrets/heapster.client-ca
- --allowed_users=%allowed_users%
- --metric_resolution=30s
- --wrapper.username_file=/hawkular-account/hawkular-metrics.username
- --wrapper.password_file=/hawkular-account/hawkular-metrics.password
- --wrapper.endpoint_check=https://hawkular-metrics:443/hawkular/metrics/status
- --sink=hawkular:https://hawkular-metrics:443?tenant=_system&labelToTenant=pod_namespace&labelNodeId=nodename&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&user=%username%&pass=%password%&filter=label(container_name:^system.slice.*|^user.slice)
- --logtostderr=true
- --vmodule=*=4
env:
- After saving the changes, delete the current heapster pod so that the new changes take effect.
Checking the health of the Cassandra Cluster
- Cassandra is the component of the Metrics cluster that is used to actually store the data for extended periods of time.
- This means if Cassandra is not working properly, the data cannot be written, and therefore not accessed and your metrics will not be reported properly.
- To check on the overall health of your Cassandra Cluster (whether that is a single pod or several) you can follow the details in this KCS that outlines a command you can run against the cluster and helps explain the output a bit.
Cassandra pod failing with corruption errors
- Occasionally, data in Cassandra can get corrupted. When this occurs, Cassandra will likely experience issues getting started back up when it detects this.
- If this occurs, you should work to clean up the corrupted data. You can see how to do this in this kcs
Hawkular-Metrics OOMEs around DeleteExpiredMetrics
- The hawkular-metrics pod had a background
Hawkular-cassandra disk is not getting cleaned up automatically
- On OCP 3.7.46 and earlier version was identified a bug and provided a workaround to clean-up snapshots. Follow Solution.
Cassandra sees extra nodes
- Cassandra occasionally gets into a state where it thinks there are more nodes than there should be. This is a known issue with bugs filed against it as well as a known workaround.
Hawkular-metrics presenting issues communicating with cassandra
- Cassandra seems to be having issues with handling the traffic and can be resolved by scaling up cassandra or reinstalling metrics. Read here for more details.
Hawkular-Metrics has NullPointerException reaching cassandra
- Cassandra has some broken data in the database and it causes hawkular-metrics to produce nullpointerexception errors. Read here for more details.
SBR
Category
Components
Article Type