Alerts in the Telemetry Operator
You create alert rules in Prometheus. The Prometheus servers send the alerts to an Alertmanager instance that manages the alerts. You create alert routes in Alertmanager to silence, inhibit, or aggregate alerts, and send notifications by using email, on-call notification systems, or chat platforms.
To create an alert, complete the following tasks:
- Enable Alertmanager for Telemetry Operator.
- Create an alert rule in Prometheus.
- Create an alert route in Alertmanager.
Prerequisites
- The Red Hat OpenStack Services on OpenShift (RHOSO) environment is deployed on a Red Hat OpenShift Container Platform (RHOCP) cluster. For more information, see Deploying Red Hat OpenStack Services on OpenShift.
- You are logged on to a workstation that has access to the RHOCP cluster, as a user with
cluster-adminprivileges. - The Telemetry service is enabled and configured on the control plane. For more information, see the
telemetryconfiguration in Creating the control plane.
Creating an alert rule in Prometheus
Prometheus evaluates alert rules to trigger notifications. If a rule condition returns an empty result set, the condition is false and no alert notification is triggered. If a rule condition returns a set of results, the condition is true and Prometheus triggers an alert notification.
Procedure
-
Create a file on your workstation that defines a
PrometheusRuleCR that contains the alert rules, for example,openstack-observability-services-alerts.yamland define the alert rules required for your environment. The following example defines four sample rules you can use to get an alert if a component is down in your deployment:apiVersion: monitoring.rhobs/v1 kind: PrometheusRule metadata: labels: service: metricStorage name: openstack-observability-services-alerts namespace: openstack spec: groups: - name: openstack-observability.services.status rules: - alert: OpenStackServicesDownWarning expr: | ( kube_pod_info{created_by_kind=~"ReplicaSet|StatefulSet|OpenStackClient"} * on(uid) group_left(phase) (kube_pod_status_phase{phase!="Running"} == 1) ) for: 10s labels: severity: warning annotations: summary: "OpenStack Service pod not running (warning)" description: | Pod {{ $labels.pod }} (controlled by {{ $labels.created_by_name }}) in namespace {{ $labels.namespace }} is not Running. Current state: {{ $labels.phase }} This has been the case for more than 10 seconds. - alert: OpenStackServicesDownCritical expr: | ( kube_pod_info{created_by_kind=~"ReplicaSet|StatefulSet|OpenStackClient"} * on(uid) group_left(phase) (kube_pod_status_phase{phase!="Running"}) == 1) ) for: 60s labels: severity: critical annotations: summary: "OpenStack Service pod not running (critical)" description: | Pod {{ $labels.pod }} (controlled by {{ $labels.created_by_name }}) in namespace {{ $labels.namespace }} is not Running. Current state: {{ $labels.phase }} This has been the case for more than 1 minute. - name: openstack-observability.scrapeconfig.status rules: - alert: OpenStackObservabilityDownWarning expr: | ( up{job=~"scrapeConfig/openstack/telemetry-(ceilometer|podman-exporter|node-exporter)"} == 0 or up{service="metric-storage-prometheus"} == 0 ) for: 10s labels: severity: warning annotations: summary: "Telemetry component down (warning)" description: | One of the RHOSO observability scrapeconfigs is down for more than 10 seconds. Instance: {{ $labels.instance }} Job: {{ $labels.job }} - alert: OpenStackObservabilityDownCritical expr: | ( up{job=~"scrapeConfig/openstack/telemetry-(ceilometer|podman-exporter|node-exporter)"} == 0 or up{service="metric-storage-prometheus"} == 0 ) for: 60s labels: severity: critical annotations: summary: "Telemetry component down (critical)" description: | One of the RHOSO observability scrapeconfigs is down for more than 1 minute. Instance: {{ $labels.instance }} Job: {{ $labels.job }}metadata.labels.service: This field must be set tometricStorageto ensure that the Prometheus instance managed bytelemetry-operatorloads the rule and that the rule becomes visible in the Prometheus dashboard.
For more information about how to configure alerting rules, see Content from prometheus.io is not included.Alerting rules.
-
Create the
PrometheusRuleobject:$ oc create -f openstack-observability-services-alerts.yamlThe Cluster Observability Operator (COO) loads the rule into Prometheus.
-
Verify that the COO loaded the rules into Prometheus:
$ oc get prometheusrules.monitoring.rhobs -n openstackNOTE: You must pass the entire CRD name,
prometheusrules.monitoring.rhobs, because there is a differentPrometheusRuleCRD that provides the rules for the RHOCP Monitoring API,prometheusrules.monitoring.coreos. -
Optional: Expose the Prometheus and Alertmanager services to access the Prometheus and Alertmanager dashboards:
$ oc expose svc metric-storage-prometheus $ oc expose svc metric-storage-alertmanagerNOTE: There is no access control in front of the Prometheus and Alertmanager dashboards. Exposing the services allows for anybody with the route to access to your Prometheus and Alertmanager dashboards.
Creating an alert route in Alertmanager
You can configure the Telemetry Operator Alertmanager instance to deliver alert notifications to an external system, such as email, IRC, or other notification channel. The Telemetry Operator does not deploy any external notifications by default. To send alert notifications to an external system, you create a Red Hat OpenShift Container Platform (RHOCP) secret that contains the configuration to be used by Alertmanager. You can also create Alertmanager configuration by using templates that are stored locally and referenced from the Secret along with any native Alertmanager configuration files.
Procedure
-
Create a file on your workstation named
alertmanager.yamlthat contains the native Alertmanager configuration you want to use. For example, the following configuration sends notifications to a webhook service:route: group_by: ['job'] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'webhook' receivers: - name: 'webhook' webhook_configs: - url: 'http://example.com/'For information about how to configure Alertmanager, see Content from prometheus.io is not included.Configuration.
-
Create a file on your workstation named
alertmanager-metric-storage-secret.yamlto define the Alertmanager secret:apiVersion: v1 kind: Secret metadata: name: alertmanager-metric-storage namespace: openstack type: Opaque data: alertmanager.yaml: {BASE64_CONFIG}metadata.name: TheSecretname must follow the naming convention format,alertmanager-<name_of_Alertmanager_resource>, in order for the Alertmanager to pick up the configuration.- Replace
{BASE64_CONFIG}with the base64-encoded Alertmanager config. You can use the following command to generate a base64-encoded config:
$ cat alertmanager.yaml | base64 -w 0 -
Optional: You can create a local template that configures the layout and format of the alert notification, for instance, you can use a template to add a title and text to the alert. The template must use a base64-encoded configuration.
data: alertmanager.yaml: {BASE64_CONFIG} template_1.tmpl: {BASE64_TEMPLATE_1} template_2.tmpl: {BASE64_TEMPLATE_2}For information about how to create Alertmanager templates, see the Content from golang.org is not included.Go template documentation and Content from prometheus.io is not included.Defining reusable templates.
-
Create the
SecretCR to apply the Alertmanager configuration for the Alertmanager instance managed by thetelemetry-operator:$ oc apply -f alertmanager-metric-storage-secret.yaml -n openstack -
Create a file on your workstation named
alertmanager-metric-storage.yamlto define theAlertmanagerCR:apiVersion: monitoring.rhobs/v1 kind: Alertmanager metadata: name: metric-storage namespace: openstack spec: configSecret: alertmanager-metric-storagemetadata.name: Specifies the name of themetricStorageAlertmanager instance.spec.configSecret: Specifies the name of the Secret CR that contains the alert routes configuration.
-
Apply the modified
AlertmanagerCR to the control plane:$ oc apply -f alertmanager-metric-storage.yaml -n openstack --server-sideNOTE: You must include the
--server-sideflag to apply the Alertmanager configuration with Server-Side Apply (SSA) because the Alertmanager resource is managed by the Telemetry Operator. For more information about SSA with the Cluster Observability Operator, see Using Server-Side Apply to customize Prometheus resources. -
Verify that the
Alertmanagerconfiguration is applied:$ oc get alertmanager.monitoring.rhobs metric-storage -n openstack -o yaml --show-managed-fields | grep configSecret f:configSecret: {} configSecret: secret-alertmanager
Additional resources
-
Content from github.com is not included.Prometheus user guide on alerting
-
Content from prometheus.io is not included.Alerting overview
-
Content from prometheus-operator.dev is not included.Alerting Routes