OpenShift Container Platform Cluster Metrics: Common Issues

Updated

Metrics for OpenShift Enterprise and OpenShift Container Platform v3 can have several different types of issues and can fail due to a number of different reasons, please try and use the following headings to determine what describes your issue best:

Deployer Issues

  • Check the logs and the events for the deployer pod to see what information can be collected as well as the specific error code (if possible)
    • The logs should be able to be collected using oc logs <NAME_OF_POD>
    • The events should be collected within 4 hours of running the deployer and can be done so with oc desribe pod <NAME_OF_POD>
  • If the logs indicate that it cannot find the image, then you simply need to add the image specification as shown in that KCS.
  • If the events indicate that it failed on mounting the volume you provided for persistent storage, then please follow this KCS, which outlines what might have gone wrong.

Metrics Cannot Attach and/or Mount the Intended Storage

  • Metrics needs to have the appropriate permissions/ownership set on the storage provided to the cassandra pod.
  • As explained in this kcs, the pod, and the SCC it uses, can dictate the group and/or user IDs and therefore you might need to chown the storage to someone else to get cassandra working.

Metrics fails to deploy with Persistent Storage when PVC is self-created

  • The Metrics Deployer creates its own PersistentVolumeClaim when you indicate that you want to use persistent storage.
    • This means that if you have manually created your own PVC and let it bind to the PV you created for Metrics, the deployer will not be able to successfully connect the Cassandra pods to the storage, as it will already be bound.
  • You can read more about this here

Pods Failing to Start

  • Check the logs for each pods (heapster, hawkular-metrics, and hawkular-cassandra) for any specific errors.
  • Also check the events for the cluster (or pod, if you are using version 3.2+) for more information about the status of the pods.
# oc project openshift-infra
# oc get pods 
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-<HASH>   1/1       Running   7          33d
hawkular-metrics-<HASH>       1/1       Running   18         33d
heapster-<HASH>               1/1       Running   8          33d
# oc logs heapster-<HASH> &> heapster.logs
# oc logs hawkular-metrics-<HASH> &> hawkular-metrics.logs
# oc logs hawkular-cassandra-<HASH> &> hawkular-cassandra.logs
# oc get events 
*****In version 3.2 we introduced a feature that ties events to pods as well in the describe feature*****
# oc describe pods heapster-<HASH> &> heapster.events
# oc describe pods hawkular-metrics-<HASH> &> hawkular-metrics.events
# oc describe pods hawkular-cassandra-<HASH> &> hawkular-cassandra.events
  • Cassandra is the main component for metrics. The other two (Heapster and Hawkular-Metrics) depend on Cassandra getting into a running state.
    • With that said, Heapster also depends on Hawkular-Metrics to be running.
  • Because of this fact, if the cassandra pod is in a failed/pending state, then those are the logs and events are going to be the most important as the other pods depend on that one functioning.

Metrics From Some Nodes

  • Make sure that the nodes are using an NTP server to synchronize their clocks (and verify that the clocks are the same).
    • This is because the nodes themselves only store metrics for around 2 minutes, so if Heapter requests the metrics from a node for time 10:00 - 10:02, but the node's clock is currently set to 10:05, then it will return an empty list of metrics.
    • NOTEif the Heapster container(events and logs) do not show any metrics, then this is not the issue (since the clock that Heapster has is it's node's clock)

Metrics From Some Nodes option 2

  • Make sure that all of the nodes have the ports open on the nodes (and any firewall) for IPv4 Traffic as heapster needs to be able to communicate directly with each system's cAdvisor in order to collect the metrics data for each node

Checking the version of metrics images

  • If you are encountering issues with your existing metrics cluster, you should make sure that you are using the newest available metrics version that relates to the version of OpenShift you are running.
    • You can see more details about how to check this/correct this, in this KCS

Do not use the latest tag for the metrics image version

  • With regard to the Cluster Metrics images, the latest tag pulls the latest version available of these images, rather than the latest version for the OpenShift version you are running.
    -This can cause issues as these images are not tested for older versions of OpenShift Container Platform.
  • You can read more about this here.

Metrics Shows Empty Charts After a Few Minutes

  • Heapster is collecting data too rapidly, with a process that takes too long. The solution is to increase the time it takes before new data is collected by increasing the metrics_resolution of stats_resolution value in the ReplicationController for Heapster.
    • metrics_resolution is for versions 3.2.1 and beyond, while stats_resolution is for versions up to, and including, 3.2.0
    • You can refer to this solution for specific details.

Metrics Charts Show Small Gaps in the Metrics

  • The issue is actually the inverse of Metrics Shows Empty Charts After a Few Minutes and therefore the way to fix the issue to decrease the metrics_resolution of stats_resolution value by following the same process mentioned in the section above.
    • metrics_resolution is for versions 3.2.1 and beyond, while stats_resolution is for versions up to, and including, 3.2.0
    • You can refer to this solution for specific details.

Updating From Previous Versions of OpenShift

Heapster "Cannot Find A Node"

  • Please confirm that the error message is the same (or very similar) to the one outlined in our solution about this and then rest easy knowing that this is a false alarm caused by an individual node metric not being reported and therefore not an actual issue.

Metrics public URL is accessible but metrics tab says Metrics are not available and 504 Gateway Timeout

  • Make sure the view role is added to hawkular service account.

     $ oc get rolebinding -n openshift-infra
     $ oc policy add-role-to-user view system:serviceaccount:openshift-infra:hawkular -n openshift-infra
    

Heapster - Increase application verbosity

  • In some cases it might be necessary to increase the application verbosity. For that, the following change is necessary to be done in the replicationcontroller/heapster
$ oc edit rc heapster
  • Then add the following two lines after the sink attribute:
      - command:
        - heapster-wrapper.sh
        - --wrapper.allowed_users_file=/secrets/heapster.allowed-users
        - --source=kubernetes.summary_api:${MASTER_URL}?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250
        - --tls_cert=/secrets/heapster.cert
        - --tls_key=/secrets/heapster.key
        - --tls_client_ca=/secrets/heapster.client-ca
        - --allowed_users=%allowed_users%
        - --metric_resolution=30s
        - --wrapper.username_file=/hawkular-account/hawkular-metrics.username
        - --wrapper.password_file=/hawkular-account/hawkular-metrics.password
        - --wrapper.endpoint_check=https://hawkular-metrics:443/hawkular/metrics/status
        - --sink=hawkular:https://hawkular-metrics:443?tenant=_system&labelToTenant=pod_namespace&labelNodeId=nodename&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&user=%username%&pass=%password%&filter=label(container_name:^system.slice.*|^user.slice)
        - --logtostderr=true
        - --vmodule=*=4
        env:
  • After saving the changes, delete the current heapster pod so that the new changes take effect.

Checking the health of the Cassandra Cluster

  • Cassandra is the component of the Metrics cluster that is used to actually store the data for extended periods of time.
    • This means if Cassandra is not working properly, the data cannot be written, and therefore not accessed and your metrics will not be reported properly.
  • To check on the overall health of your Cassandra Cluster (whether that is a single pod or several) you can follow the details in this KCS that outlines a command you can run against the cluster and helps explain the output a bit.

Cassandra pod failing with corruption errors

  • Occasionally, data in Cassandra can get corrupted. When this occurs, Cassandra will likely experience issues getting started back up when it detects this.
    • If this occurs, you should work to clean up the corrupted data. You can see how to do this in this kcs

Hawkular-Metrics OOMEs around DeleteExpiredMetrics

  • The hawkular-metrics pod had a background

Hawkular-cassandra disk is not getting cleaned up automatically

  • On OCP 3.7.46 and earlier version was identified a bug and provided a workaround to clean-up snapshots. Follow Solution.

Cassandra sees extra nodes

  • Cassandra occasionally gets into a state where it thinks there are more nodes than there should be. This is a known issue with bugs filed against it as well as a known workaround.

Hawkular-metrics presenting issues communicating with cassandra

  • Cassandra seems to be having issues with handling the traffic and can be resolved by scaling up cassandra or reinstalling metrics. Read here for more details.

Hawkular-Metrics has NullPointerException reaching cassandra

  • Cassandra has some broken data in the database and it causes hawkular-metrics to produce nullpointerexception errors. Read here for more details.
SBR
Category
Article Type