Dedicated Service Monitors - Questions and Answers

Solution Unverified - Updated

Environment

Red Hat OpenShift Container Platform 4

Issue

Red Hat OpenShift Container Platform's Cluster Monitoring Operator can be configured with the This page is not included, but the link has been rewritten to point to the nearest parent document.DedicatedServiceMonitors feature.

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    k8sPrometheusAdapter:
      dedicatedServiceMonitors:
        enabled: true

When enabled is set to true, the Cluster Monitoring Operator (CMO) deploys a dedicated Service Monitor that exposes the kubelet /metrics/resource endpoint. This Service Monitor sets honorTimestamps: true and only keeps metrics that are relevant for the pod resource queries of Prometheus Adapter. Additionally, Prometheus Adapter is configured to use these dedicated metrics. Overall, this feature improves the consistency of Prometheus Adapter-based CPU usage measurements used by, for example, the oc adm top pod command or the Horizontal Pod Autoscaler.

Explain the reasoning behind the feature, design implications and future improvements to the monitoring stack.

Resolution

cAdvisor metrics (which is what Kubelet exposes) have always been exposed in a non optimal way as a performance hack. The metrics are exposed with explicit timestamps, whereas Prometheus prefers exporters to get new values at scrape time and then use the scrape time stamp. This has far reaching consequences (as detailed below). The k8s community and Red Hat are aware of this long standing issue and there is an effort underway to retire cAdvisor. Given the complex nature of container runtimes though this is a long process and Red Hat engineering wanted to provide a fix for the consequences before the community is ready to retire cAdvisor.

Q: Why honor timestamps is not used by the original kubelet ServiceMonitor?

In 2019, cAdvisor v0.33.0 starts to include timestamps in their metrics. This change is included in k8s 1.18 and kubelet starts to do the same on the /metrics/cadvisor endpoint.
In 2020, it is discovered that (... this ...) can lead to stale data being ingested as can be observed for example when the container is OOMKilled or otherwise restarted. Some of the example graphs show that stale data can be seen for minutes after a pod restarts. This is fixed in Content from github.com is not included.Content from github.com is not included.https://github.com/prometheus-operator/kube-prometheus/pull/695 by setting honorTimestamps=false.

Q: Why does using honorTimestamps=true lead to stale data, whereas honorTimestamps=false does not?

Since version 2.0 Prometheus has improved its notion of a stale time series. This is now based on when the last sample of a time series was seen vs. a hard 5 minute timeout in Prometheus v1. This is implemented via stale markers, that are ingested into the Time Series DataBase (Content from github.com is not included.tsdb) when Prometheus considers a time series stales.
This staleness notion however can not apply to time series that are ingested with honorTimestamps=true since in that case event interleaving could be so that Prometheus considers a time series stales (and ingests a stale marker) but then ingests a sample for that series that happened before the stale marker (since the exporter attaches a timestamp in the past). This breaks the fundamental append-only invariant of Prometheus' tsdb. This means that exporters that expose explicit timestamps fall under the old staleness notion (stale after 5 minutes).

Q: Why does the Cluster Monitoring Operator care about stale data?

The old staleness notion impacts alerts negatively. Red Hat OpenShift Container Platform has several alerts that aggregate e.g. pod resource requests. With the old staleness notion, a recently restarted pod will be accounted for twice. For alerts that would mean one either gets false positives (during the 5 minute staleness timeout) or the alerts would have to be delayed to take these 5 minutes into account.
The upstream community (Red Hat included) have given alerting the preference here. That means Red Hat OpenShift Container Platform uses honorTimestamps: false in order to benefit from improved staleness handling.

Q: Why was DedicatedServiceMonitors introduced as an additional configuration?

In 2022, in This content is not included.This content is not included.https://issues.redhat.com/browse/MON-1949, it is discovered that setting honorTimestamps=false leads to spiky and inaccurate data collection. As a solution, the DedicatedServiceMonitors feature is introduced with This content is not included.This content is not included.https://issues.redhat.com/browse/OCPBUGS-1364 | Content from github.com is not included.Content from github.com is not included.https://github.com/openshift/cluster-monitoring-operator/pull/1752.

You can now configure an optional kubelet service monitor for Prometheus Adapter (PA) that improves data consistency across multiple
autoscaling requests.
Enabling this service monitor eliminates the possibility that two queries sent at the same time to PA might yield different results
because the underlying PromQL queries executed by PA might be on different Prometheus servers.

(Content from github.com is not included.Content from github.com is not included.https://github.com/openshift/openshift-docs/pull/52696/commits/361214692fda573e6a1aa0c5431ca6ecca8efbc0)

Configuration for prometheus-adapter dedicated Service Monitors. When Enabled is set to true, the Cluster Monitoring Operator (CMO)
will deploy and scrape a dedicated Service Monitor, that exposes the kubelet /metrics/resource endpoint. This Service Monitor sets
honorTimestamps: true and only keeps metrics that are relevant for the pod resource queries of prometheus-adapter. Additionally
prometheus-adapter is configured to use these dedicated metrics. Overall this will improve the consistency of prometheus-adapter
based CPU usage measurements used by for example the oc adm top pod command or the Horizontal Pod Autoscaler.

(Content from github.com is not included.Content from github.com is not included.https://github.com/openshift/cluster-monitoring-operator/blob/9d8b9027a2c9a2147e60c8f1218ee3614ed6dff6/pkg/manifests/types.go#L93-L101)

Q: Is there additional overhead due to Dedicated Service Monitors?

One caveat is that this setting will cause additional overhead. The stack will ingest two additional time series per pod running in the cluster (This content is not included.This content is not included.https://issues.redhat.com/browse/OBSDOCS-64)
However, this additional overhead was shown to be minimal (This content is not included.This content is not included.https://issues.redhat.com/browse/MON-1949)

Q: Why wasn't the solution to add a toggle to change honorTimestamps for the kubelet ServiceMonitor? Why is DedicatedServiceMonitors not the default setting if it is more exact? Why is it preferable to have made the default kubelet ServiceMonitor less exact (honorTimestamps=false) but without stale data? An alternate design could have been to have the kubelet ServiceMonitor be more exact (honorTimestamps=true) but to have a Dedicated Service Monitor with honorTimestamps=false for customers who wish not to have stale data but at the cost of accuracy?

Switching on dedicated service monitors incurs a small overhead of ingesting two additional time series per pod in the cluster.
Red Hat engineering decided to add this as additional configuration since they want to preserve the existing alerting behavior for all users. The alerts are expected to work well out of the box for everyone. Red Hat did not enable the setting by default since the number of users using pod autoscaling based on CPU usage is comparatively smaller than users not using autoscaling at all.
In order to avoid the small overhead for users that wouldn't benefit from it, Red Hat engineering decided to make this behavior opt-in. Users that do opt-in benefit from accurate autoscaling and accurate and fast alerting.

Q: Is DedicatedServiceMonitors more exact / provides better metrics than the original Kubelet ServiceMonitor with honorTimestamps=true? If so, why? (As in the situation before [Content from github.com is not included.https://github.com/prometheus-operator/kube-prometheus/pull/695])

The data samples that are ingested are actually the same. The only thing that changes is the timestamp associated with the value. Where accuracy changes is when this data is aggregated over time, since there the timestamps do play a role. Functions like sum and avg become more accurate. For queries that rely on the latest data both metrics would yield the same result.

Q: Does DedicatedServiceMonitors suffer from the same stale data issue that the default configuration was suffering from with honorTimestamps=true? If so, doesn't this mean that each configuration has a different trade-off? The default configuration would be less exact but with less overhead and no stale data, whereas DedicatedServiceMonitors introduces exact metrics but at the cost of a slight overhead and stale data?

The metrics obtained through DedicatedServiceMonitors do indeed suffer from the old staleness notion. Red Hat OCP however uses them only for prometheus-adapter queries to implement the metrics API. These queries target specific pods, which are retrieved from the k8s API. In effect the stale metrics have no negative impact.

Q: Is Red Hat working on a solution without stale data, but with high accuracy?

Red Hat and the upstream community are indeed working on such a solution. The relevant discussion on this happens in Content from github.com is not included.Content from github.com is not included.https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2371-cri-pod-container-stats/README.md.
The goal is to replace cAdvisor. Several people looked into fixing performance issues in cAdvisor. These performance issues are ultimately the reason why it exposes metrics with explicit time stamps. This is however a large effort that will likely take multiple k8s releases. Since container metrics exposure is moved into the container runtime interface (CRI), all existing and supported container runtimes need to implement this new feature. This is a lengthy process and motivated Red Hat engineering to deliver a workaround that works well and incurs minimal overhead.

Q: Why does DedicatedServiceMonitors pull data from /endpoints/metrics instead of /endpoints/cadvisors?

The answer to this can be found in the discussion here: Content from github.com is not included.Content from github.com is not included.https://github.com/openshift/cluster-monitoring-operator/pull/1752
According to the discussion, targeting /metrics/resource is more efficient on the Prometheus side because no metrics need to be dropped. It might also be less demanding on the kubelet side.
Looking at the kubelet code base, the /metrics/resource handler seems to be more lightweight than cAdvisor in terms of resource usage.
The other advantage is that one can scope the prometheus-adapter's queries using the metrics_path label (e.g. container_cpu_usage_seconds_total{metrics_path="/metrics/resource",...})
Content from kubernetes.io is not included.Content from kubernetes.io is not included.https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/
Content from kubernetes.io is not included.Content from kubernetes.io is not included.https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#metrics-server

The data and data source remain the same, kubelet only adds a small performance improvement.

Q: Why is the DedicatedServiceMonitors feature using a different prefix?

The conversation in Content from github.com is not included.Content from github.com is not included.https://github.com/openshift/cluster-monitoring-operator/pull/1752 clarifies this:
Among other things, the name prefix makes it clearer that the new metric really doesn't behave the same way. The different timestamps and staleness handling could otherwise lead to some surprising behavior.

SBR
Components

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.