Logging alerts

Red Hat OpenShift Logging 6.5

Configuring logging alerts.

Red Hat OpenShift Documentation Team

Abstract

This document provides information about configuring logging alerts.

Chapter 1. Default logging alerts

Logging alerts are installed as part of the Red Hat OpenShift Logging Operator installation. Alerts depend on metrics exported by the log collection and log storage backends. These metrics are enabled if you selected the option to Enable Operator recommended cluster monitoring on this namespace when installing the Red Hat OpenShift Logging Operator.

Default logging alerts are sent to the OpenShift Container Platform monitoring stack Alertmanager in the openshift-monitoring namespace, unless you have disabled the local Alertmanager instance.

1.1. Accessing the Alerting UI from the Administrator perspective

You can access the Alerting user interface (UI) through the Administrator perspective of the OpenShift Container Platform web console.

Prerequisites

You have administrator permissions.
You have access to the OpenShift Container Platform web console.

Procedure

From the Administrator perspective, go to Observe → Alerting. The three main pages in the Alerting UI in this perspective are the Alerts, Silences, and Alerting rules pages.

1.2. Red Hat OpenShift Logging Operator alerts

The following alerts are generated by the Vector collector. You can view these alerts in the OpenShift Container Platform web console.

Table 1.1. Vector collector alerts

Alert	Message	Description	Severity
`CollectorNodeDown`	`Prometheus could not scrape vector <instance> for more than 10m.`	Vector is reporting that Prometheus could not scrape a specific Vector instance.	Critical
`DiskBufferUsage`	`Collectors potentially consuming too much node disk, <value>`	Collectors are consuming too much node disk on the host.	Warning
`CollectorHigh403ForbiddenResponseRate`	`High rate of "HTTP 403 Forbidden" responses detected for collector <instance> in namespace <namespace> for output <label>. The rate of 403 responses is <rate> over the last 2 minutes, persisting for more than 5 minutes. This could indicate an authorization issue.`	At least 10% of sent requests responded with "HTTP 403 Forbidden" for collector "<intance>" in namespace <namespace> for the output "<output>".	Critical

1.3. Loki Operator alerts

The following alerts are generated by the Loki Operator. You can view these alerts in the OpenShift Container Platform web console.

Table 1.2. Loki Operator alerts

Alert	Message	Description	Severity
`LokiIngesterFlushFailureRateCritical`	`Loki ingester {{ $labels.pod }} in the namespace {{ $labels.namespace }} has a critical flush failure rate of {{ $value \| humanizePercentage }} over the last 5 minutes. This requires immediate attention as data is not being flushed to the storage. Validate if the storage configuration is still valid and if the storage is still reachable. Current failure rate: {{ $value \| humanizePercentage }} Threshold: 20%`	One or more Loki ingesters are failing to flush at least 20% of their chunks to backend storage over a 5-minute period. This indicates issues with storage connectivity, authentication, or storage capacity that require immediate intervention.	critical
`LokiRequestErrors`	`{{ $labels.job }} {{ $labels.route }} is experiencing <value>% errors.`	At least 10% of requests result in `5xx` server errors.	critical
`LokiStackWriteRequestErrors`	`<value>% of write requests from {{ $labels.job }} in <namespace> are returned with server errors.`	At least 10% of write requests to the lokistack-gateway result in `5xx` server errors.	critical
`LokiStackReadRequestErrors`	`<value>% of query requests from {{ $labels.job }} in <namespace> are returned with server errors.`	At least 10% of query requests to the lokistack-gateway result in `5xx` server errors.	critical
`LokiRequestPanics`	`{{ $labels.job }} is experiencing an increase of <value> panics.`	A panic was triggered.	critical
`LokiRequestLatency`	`{{ $labels.job }} {{ $labels.route }} is experiencing <value>s 99th percentile latency.`	The 99th percentile is experiencing latency higher than 1 second.	critical
`LokiTenantRateLimit`	`{{ $labels.job }} {{ $labels.route }} is experiencing 429 errors.`	At least 10% of requests are received the rate limit error code.	warning
`LokiStorageSlowWrite`	`The storage path is experiencing slow write response rates.`	The storage path is experiencing slow read response rates.	warning
`LokiWritePathHighLoad`	The write path is experiencing high load.`	The write path is experiencing high load causing backpressure storage flushing.	warning
`LokiReadPathHighLoad`	`The read path is experiencing high load.`	The read path has a high volume of queries, causing longer response times.	warning
`LokiDiscardedSamplesWarning`	`Loki in namespace "<namespace>" is discarding samples in the "<tenant>" tenant during ingestion. Samples are discarded because of "<reason>" at a rate of <value> samples per second.`	Loki is discarding samples during ingestion because they fail validation.	warning
`LokistackComponentsNotReadyWarning`	`The LokiStack "{{ $labels.stack_name }}" in namespace "{{ $labels.namespace }}" has components that are not ready.`	The `LokiStack` resource is reporting that some components have not reached the `Ready` state. This might be related to OpenShift resources such as pods or deployments, configuration, or external dependencies.	warning
`LokistackSchemaUpgradesRequired`	`The LokiStack "{{ $labels.stack_name }}" in namespace "<namespace>" is using a storage schema configuration that does not contain the latest schema version. It is recommended to update the schema configuration to update the schema version to the latest`	One or more of the deployed LokiStacks contains an outdated storage schema configuration.	warning

1.4. Additional resources

This page is not included, but the link has been rewritten to point to the nearest parent document.Modifying core platform alerting rules

Chapter 2. Custom logging alerts

You can configure the LokiStack deployment to produce customized alerts and recorded metrics. If you want to use customized Content from grafana.com is not included.alerting and recording rules, you must enable the LokiStack ruler component.

LokiStack log-based alerts and recorded metrics are triggered by providing Content from grafana.com is not included.LogQL (Grafana documentation) expressions to the ruler component.

To provide these expressions, you must create an AlertingRule custom resource (CR) containing Content from prometheus.io is not included.alerting rules, or a RecordingRule CR containing Prometheus-compatible Content from prometheus.io is not included.recording rules (Prometheus documentation).

Administrators can configure log-based alerts or recorded metrics for application, audit, or infrastructure tenants. Users without administrator permissions can configure log-based alerts or recorded metrics for application tenants of the applications that they have access to.

Application, audit, and infrastructure alerts are sent by default to the OpenShift Container Platform monitoring stack Alertmanager in the openshift-monitoring namespace, unless you have disabled the local Alertmanager instance. If the Alertmanager that is used to monitor user-defined projects in the openshift-user-workload-monitoring namespace is enabled, application alerts are sent to the Alertmanager in this namespace by default.

2.1. Configuring the ruler

When the LokiStack ruler component is enabled, users can define a group of Content from grafana.com is not included.LogQL (Grafana documentation) expressions that trigger logging alerts or recorded metrics.

Administrators can enable the ruler by modifying the LokiStack custom resource (CR).

Prerequisites

You have installed the Red Hat OpenShift Logging Operator and the Loki Operator.
You have created a LokiStack CR.
You have administrator permissions.

Procedure

Enable the ruler by ensuring that the LokiStack CR has the following spec configuration:
```
apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
  name: <name>
  namespace: <namespace>
spec:
# ...
  rules:
    enabled: true 1
    selector: 2
      matchLabels:
        <label_name>: "true" 3
    namespaceSelector: 4
      matchLabels:
        <label_name>: "true" 5
```
1
Enable Loki alerting and recording rules in your cluster.
2
Specify the selector for the alerting and recording resources.
3
Add a custom label that can be added to namespaces where you want to enable the use of logging alerts and metrics.
4
Specify the namespaces in which the alerting and recording rules are defined for the Loki Operator. If undefined, only the rules defined in the same namespace as the LokiStack are used.
5
Add a custom label that can be added to namespaces where you want to enable the use of logging alerts and metrics.

2.2. Authorizing LokiStack rules RBAC permissions

Administrators can allow users to create and manage their own alerting and recording rules by binding cluster roles to usernames. Cluster roles are defined as ClusterRole objects that contain necessary role-based access control (RBAC) permissions for users.

The following cluster roles for alerting and recording rules are available for LokiStack:

Rule name	Description
`alertingrules.loki.grafana.com-v1-admin`	Users with this role have administrative-level access to manage alerting rules. This cluster role grants permissions to create, read, update, delete, list, and watch `AlertingRule` resources within the `loki.grafana.com/v1` API group.
`alertingrules.loki.grafana.com-v1-crdview`	Users with this role can view the definitions of Custom Resource Definitions (CRDs) related to `AlertingRule` resources within the `loki.grafana.com/v1` API group, but do not have permissions for modifying or managing these resources.
`alertingrules.loki.grafana.com-v1-edit`	Users with this role have permission to create, update, and delete `AlertingRule` resources.
`alertingrules.loki.grafana.com-v1-view`	Users with this role can read `AlertingRule` resources within the `loki.grafana.com/v1` API group. They can inspect configurations, labels, and annotations for existing alerting rules but cannot make any modifications to them.
`recordingrules.loki.grafana.com-v1-admin`	Users with this role have administrative-level access to manage recording rules. This cluster role grants permissions to create, read, update, delete, list, and watch `RecordingRule` resources within the `loki.grafana.com/v1` API group.
`recordingrules.loki.grafana.com-v1-crdview`	Users with this role can view the definitions of Custom Resource Definitions (CRDs) related to `RecordingRule` resources within the `loki.grafana.com/v1` API group, but do not have permissions for modifying or managing these resources.
`recordingrules.loki.grafana.com-v1-edit`	Users with this role have permission to create, update, and delete `RecordingRule` resources.
`recordingrules.loki.grafana.com-v1-view`	Users with this role can read `RecordingRule` resources within the `loki.grafana.com/v1` API group. They can inspect configurations, labels, and annotations for existing alerting rules but cannot make any modifications to them.

2.2.1. Examples

To apply cluster roles for a user, you must bind an existing cluster role to a specific username.

Cluster roles can be cluster or namespace scoped, depending on which type of role binding you use. When a RoleBinding object is used, as when using the oc adm policy add-role-to-user command, the cluster role only applies to the specified namespace. When a ClusterRoleBinding object is used, as when using the oc adm policy add-cluster-role-to-user command, the cluster role applies to all namespaces in the cluster.

The following example command gives the specified user create, read, update and delete (CRUD) permissions for alerting rules in a specific namespace in the cluster:

Example cluster role binding command for alerting rule CRUD permissions in a specific namespace

$ oc adm policy add-role-to-user alertingrules.loki.grafana.com-v1-admin -n <namespace> <username>

The following command gives the specified user administrator permissions for alerting rules in all namespaces:

Example cluster role binding command for administrator permissions

$ oc adm policy add-cluster-role-to-user alertingrules.loki.grafana.com-v1-admin <username>

Additional resources

Using RBAC to define and apply permissions

2.3. Creating a log-based alerting rule with Loki

The AlertingRule CR contains a set of specifications and webhook validation definitions to declare groups of alerting rules for a single LokiStack instance. In addition, the webhook validation definition provides support for rule validation conditions:

If an AlertingRule CR includes an invalid interval period, it is an invalid alerting rule
If an AlertingRule CR includes an invalid for period, it is an invalid alerting rule.
If an AlertingRule CR includes an invalid LogQL expr, it is an invalid alerting rule.
If an AlertingRule CR includes two groups with the same name, it is an invalid alerting rule.
If none of the above applies, an alerting rule is considered valid.

Tenant type	Valid namespaces for `AlertingRule` CRs
audit	`openshift-logging`
infrastructure	`openshift-`, `kube-`, `default`
application	All other namespaces.

Prerequisites

Red Hat OpenShift Logging Operator 5.7 and later
OpenShift Container Platform 4.13 and later

Procedure

Create an AlertingRule custom resource (CR):

Example infrastructure AlertingRule CR

  apiVersion: loki.grafana.com/v1
  kind: AlertingRule
  metadata:
    name: loki-operator-alerts
    namespace: openshift-operators-redhat 1
    labels: 2
      openshift.io/cluster-monitoring: "true"
  spec:
    tenantID: infrastructure 3
    groups:
      - name: LokiOperatorHighReconciliationError
        rules:
          - alert: HighPercentageError
            expr: | 4
              sum(rate({kubernetes_namespace_name="openshift-operators-redhat", kubernetes_pod_name=~"loki-operator-controller-manager.*"} |= "error" [1m])) by (job)
                /
              sum(rate({kubernetes_namespace_name="openshift-operators-redhat", kubernetes_pod_name=~"loki-operator-controller-manager.*"}[1m])) by (job)
                > 0.01
            for: 10s
            labels:
              severity: critical 5
            annotations:
              summary: High Loki Operator Reconciliation Errors 6
              description: High Loki Operator Reconciliation Errors 7

1: The namespace where this AlertingRule CR is created must have a label matching the LokiStack spec.rules.namespaceSelector definition.
2: The labels block must match the LokiStack spec.rules.selector definition.
3: AlertingRule CRs for infrastructure tenants are only supported in the openshift-*, kube-*, or default namespaces.
4: The value for kubernetes_namespace_name: must match the value for metadata.namespace.
5: The value of this mandatory field must be critical, warning, or info.
6: This field is mandatory.
7: This field is mandatory.

Example application AlertingRule CR

  apiVersion: loki.grafana.com/v1
  kind: AlertingRule
  metadata:
    name: app-user-workload
    namespace: app-ns 1
    labels: 2
      openshift.io/cluster-monitoring: "true"
  spec:
    tenantID: application
    groups:
      - name: AppUserWorkloadHighError
        rules:
          - alert:
            expr: | 3
              sum(rate({kubernetes_namespace_name="app-ns", kubernetes_pod_name=~"podName.*"} |= "error" [1m])) by (job)
            for: 10s
            labels:
              severity: critical 4
            annotations:
              summary: This is an example summary. 5
              description: This is an example description. 6

1: The namespace where this AlertingRule CR is created must have a label matching the LokiStack spec.rules.namespaceSelector definition.
2: The labels block must match the LokiStack spec.rules.selector definition.
3: Value for kubernetes_namespace_name: must match the value for metadata.namespace.
4: The value of this mandatory field must be critical, warning, or info.
5: The value of this mandatory field is a summary of the rule.
6: The value of this mandatory field is a detailed description of the rule.

Apply the AlertingRule CR:
```
$ oc apply -f <filename>.yaml
```

2.4. Additional resources

Legal Notice

Except as otherwise noted below, the text of and illustrations in this documentation are licensed by Red Hat under the Creative Commons Attribution–Share Alike 3.0 Unported license . If you distribute this document or an adaptation of it, you must provide the URL for the original version.

Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.

Red Hat, the Red Hat logo, JBoss, Hibernate, and RHCE are trademarks or registered trademarks of Red Hat, LLC. or its subsidiaries in the United States and other countries.

Linux® is the registered trademark of Linus Torvalds in the United States and other countries.

XFS is a trademark or registered trademark of Hewlett Packard Enterprise Development LP or its subsidiaries in the United States and other countries.

The OpenStack® Word Mark and OpenStack logo are trademarks or registered trademarks of the Linux Foundation, used under license.

All other trademarks are the property of their respective owners.