Troubleshooting OpenShift Container Platform 3.x: Cluster Metrics

Updated 25 May 2021

Starting off, a review of the main components involved:

Collecting and Storing Metric Data
- Heapster periodically polls cAdvisor on every node in the cluster to gather pod-level metrics. Heapster sends those metrics to Hawkular Metrics by way of an HTTP request.
- Hawkular Metrics performs some transformations on the data in the request and then sends multiple write requests to Cassandra to persist the metrics data.
- It is important to note that collecting and storing metricsd is an automated process that happens in the background.

Viewing Graphs

The web console makes a REST API call to Hawkular Metrics, which in turn submits one or more queries to Cassandra.
Cassandra return data points to Hawkular Metrics, which then applies transformation, notably converting the data to JSON, and then includes the data in the HTTP response.

Diagnostics

To investigate issues with assistance from Red Hat, please collect logs, pod templates, full number of pods in the cluster, and specific details from cassandra:

# Logs:
oc logs -n openshift-infra <hawkular-cassandra_pods>
oc logs -n openshift-infra <hawkular-metrics_pod>
oc logs -n openshift-infra <heapster_pod>

# If the pods have recently restarted due to any sort of issue, please add the option -p (for previous) to the above commands:
oc logs -n openshift-infra -p <hawkular-cassandra_pods>
oc logs -n openshift-infra -p <hawkular-metrics_pod>
oc logs -n openshift-infra -p <heapster_pod>

# Pod templates:
oc -n openshift-infra get pods -o yaml

# Number of pods in the cluster:
oc get pods --all-namespaces | wc -l

# Cassandra information (needs to be run against each cassandra pod):
oc -n openshift-infra exec <cassandra-pod> -- nodetool status
oc -n openshift-infra exec <cassandra-pod> -- nodetool tpstats
oc -n openshift-infra exec <cassandra-pod> -- nodetool describecluster
oc -n openshift-infra exec <cassandra-pod> -- nodetool proxyhistograms
oc -n openshift-infra exec <cassandra-pod> -- nodetool tablestats hawkular_metrics
oc -n openshift-infra exec <cassandra-pod> -- nodetool tablehistograms hawkular_metrics data
oc -n openshift-infra exec <cassandra-pod> -- nodetool tablehistograms hawkular_metrics metrics_tags_idx

These commands have also been built into the OpenShift SOS Plugin, one simply needs to run that against the openshift-infra project and it will automatically collect this information.
If more debug level is needed, change Hawkular metrics logging level setting the ADDITIONAL_LOGGING environment variable on the following manner:
```
$ oc set env rc/hawkular-metrics ADDITIONAL_LOGGING="org.hawkular.metrics=DEBUG"
```

Debug Info

To enable debug logs for Heapster, please refer to this solution that outlines how to modify the verbosity.
To enable debug logs for Hawkular-Metrics, please refer to this solution which provides exact steps to do so.
Cassandra already provides a significant amount of information and typically does not require Additional verbosity

Cassandra and Garbage Collection

It is very helpful to have a little background on this topic because it is usually one of the first things to look at when debugging and tuning environments.
- The Java Heap is divided into two sections, called the young, or new, generation and the old, or tenured, generation.
- Newly created and short-lived objects reside in the young generation
- Longer living objects eventually get moved from the young generation into the old generation.
Cassandra uses the parallel collector for minor collections in the young generation, and it uses the ConcurrentMarkSweep (CMS) collector for major collections in the old generation.
- The parallel collector is a multi-threaded collector. It trades CPU for speed. Parallel collections, particularly in the young generation, should be fast, even sub-millisecond.
- This is important because parallel collections are Stop The World (STW) collections, meaning all application threads are stopped while the collection runs.
- CMS collections are single threaded and involve a number of steps. CMS collections run alongside application threads; however, there are a couple steps in which STW pauses happen.
By default, Cassandra logs an INFO message for any garbage collection (GC) that takes longer than 200 ms and a WARN message for any GC that takes longer than 1000 ms. Here are a couple examples to illustrate:
```
INFO  [Service Thread] 2017-09-13 08:06:47,234 GCInspector.java:284 - ParNew GC in 227ms.  CMS Old Gen: 384823424 -> 412967784; Par Eden Space: 167772160 -> 0;

WARN  [Service Thread] 2018-01-31 09:38:52,297 GCInspector.java:282 - ConcurrentMarkSweep GC in 2815ms.  CMS Old Gen: 814992680 -> 395257864; Code Cache: 32195200 -> 32327424; Par Eden Space: 20834528 -> 405712136;
```
- The INFO message tells us that there was a ParNew (a parallel new generation) collection that took us 227 ms.
- The WARN message tells us that there was a CMS collection that took 2815 ms.

hawkular-metrics Liveness Probe Fails

The liveness probe executes a GET against a status endpoint that does not perform any real work.
- The endpoint does not query Cassandra or perform any resource-intensive computations.
- Typically, when the liveness probe fails it is almost always because of GC pauses which are followed by the JVM throwing an OutOfMemoryError.
Usually, hawkular-metrics will not produce OOMErrors.
- If it does happen, please open a This content is not included.support ticket and provide the diagnostics information.
Unfortunately we are limited with the diagnostic information we can get when the hawkular-metrics JVM crashes with an OutOfMemoryError.
- The JVM will write a heap dump file on exit, but it is hard to get a hold of the heap dump, because it is written to the container file system. This makes it hard to collect the dump because it is lost on restart. Writing the heap dump file to a persistent volume is something that we should do only temporarily on a case by case basis if the problem persists and requires more in-depth debugging.

hawkular-metrics OOM killed without OOME

Note: this header specifically says without an OOMError because if this does happen, please open a This content is not included.support ticket and provide the diagnostics information.

This behavior indicates that the pod has hit its memory limit or that the node it was running on is over committed.
There are two solutions for this sort of error:
- Scaling the hawkular-metrics rc up, or
- adjusting the memory settings
If choosing to adjust the memory settings, we have the below table as a rough guide

NOTE: These numbers are not an official recommendation for a number of reasons, one specifically being that scaling up hawkular-metrics can also alleviate OOM issues

Cluster Size	Memory available for hawkular-metrics pod
< 500 pods	1 GB
500 - 2000 pods	2 GB

2000 pods | 4 GB

There are a number of different explanations for hawkular-metrics encountering an OOME:
- High read load due to user-facing requests. This is not very likely due to how little data the hawkular-metrics pod actually collects for said queries
- Background jobs doing reads can fetch a large amount of data causing heap pressure. Please see the Common Issues for information about some known jobs.
More to come...

Common Issues

There is a separate article regarding Common Issues with OpenShift Container Platform Cluster Metrics.
Please visit this article for further assistance.

Product(s)