Configuring observability

Red Hat OpenStack Services on OpenShift 18.0

Configuring the Telemetry service (ceilometer, prometheus) to manage observability metrics for a Red Hat OpenStack Services on OpenShift deployment

Abstract

Configure and manage the Telemetry service (ceilometer, prometheus) to collect observability metrics for a Red Hat OpenStack Services on OpenShift deployment.

Providing feedback on Red Hat documentation

We appreciate your feedback. Tell us how we can improve the documentation.

To provide documentation feedback for Red Hat OpenStack Services on OpenShift (RHOSO), create a Jira issue in the OSPRH Jira project.

Procedure

  1. Log in to the Red Hat Atlassian Jira.
  2. Click the following link to open a Create Issue page: Content from redhat.atlassian.net is not included.Create issue
  3. Complete the Summary and Description fields. In the Description field, include the documentation URL, chapter or section number, and a detailed description of the issue.
  4. Click Create.
  5. Review the details of the bug you created.

Chapter 1. Configuring observability

Use observability to gain insights into the metrics, logs, and alerts from your Red Hat OpenStack Services on OpenShift (RHOSO) deployment. You can configure observability by editing the default Telemetry service (ceilometer, prometheus) in your OpenStackControlPlane custom resource (CR) file.

1.1. RHOSO observability architecture

The observability architecture in Red Hat OpenStack Services on OpenShift (RHOSO) is composed of services within Red Hat OpenShift Container Platform (RHOCP), as well as services on your Compute nodes that provide metrics, logs, and alerts. You can use Red Hat OpenShift Observability for insight into your RHOSO environment and for collecting, storing, and searching through logs.

Important

The observability platform available with RHOSO does not guarantee the delivery of metrics. Metrics are exposed for scraping but they are not cached. If data is dropped there is no ability to retrospectively fill in gaps in the data, which might result in incomplete metrics.

1.2. Configuring observability on the control plane

The Telemetry service (ceilometer, prometheus) is enabled by default in a Red Hat OpenStack Services on OpenShift (RHOSO) deployment. You can configure observability by editing the Telemetry service in your OpenStackControlPlane custom resource (CR) file.

Prerequisites

  • The control plane includes initial configuration of the Telemetry service. For more information, see the telemetry configuration in Creating the control plane in Deploying Red Hat OpenStack Services on OpenShift.

Procedure

  1. Open your OpenStackControlPlane CR file, openstack_control_plane.yaml, on your workstation.
  2. Configure the Telemetry service, telemetry, as required for your environment:

      telemetry:
        enabled: true
        template:
          metricStorage:
            enabled: true
            dashboardsEnabled: true
            dataplaneNetwork: ctlplane
            networkAttachments:
              - ctlplane
            monitoringStack:
              alertingEnabled: true
              scrapeInterval: 30s
              storage:
                strategy: persistent
                retention: 24h
                persistent:
                  pvcStorageRequest: 20G
          autoscaling:
            enabled: false
            aodh:
              notificationsBus:
                cluster: rabbitmq-notification
              databaseAccount: aodh
              databaseInstance: openstack
              secret: osp-secret
            heatInstance: heat
          ceilometer:
            enabled: true
            notificationsBus:
              cluster: rabbitmq-notification
            secret: osp-secret
          logging:
            enabled: false
            annotations:
              metallb.universe.tf/address-pool: internalapi
              metallb.universe.tf/allow-shared-ip: internalapi
              metallb.universe.tf/loadBalancerIPs: 172.17.0.80
          cloudkitty:
            enabled: false
            messagingBus:
              cluster: rabbitmq
            s3StorageConfig:
              schemas:
              - effectiveDate: "2024-11-18"
                version: v13
              secret:
                name: cloudkitty-loki-s3
                type: s3
    • metricStorage.monitoringStack.scrapeInterval: Specifies the interval at which new metrics are gathered. Changing this interval can affect performance.
    • metricStorage.monitoringStack.storage.retention: Specifies the length of time that telemetry metrics are stored. The duration affects the amount of storage required.
    • storage.persistent.pvcStorageRequest: Specifies the amount of storage to allocate to the Prometheus time series database.
    • autoscaling.enabled: Set to true to enable autoscaling. The autoscaling field must be present even when autoscaling is disabled. For more information about autoscaling, see Autoscaling for Instances.
    • ceilometer.enabled: Set to false to disable the ceilometer service. If you do not disable ceilometer, then a Prometheus metrics exporter is created and exposed from inside the cluster at the following URL: Content from ceilometer-internal.openstack.svc is not included.http://ceilometer-internal.openstack.svc:3000/metrics
    • logging.enabled: Set to true to enable observability logging. For more information about configuring observability logging, see Enabling RHOSO observability logging.
    • cloudkitty.enabled: Set to true to enable the Rating service (cloudkitty). For more information about configuring chargeback and rating capabilities, see Enabling the Rating service on the control plane.
    • aodh.notificationsBus.cluster and ceilometer.notificationsBus.cluster: Set to a dedicated RabbitMQ cluster for notifications as in this example or to a combined RabbitMQ cluster for both RPC and notifications as in the default RHOSO deployment that uses the combined rabbitmq cluster.
  3. Update the control plane:

    $ oc apply -f openstack_control_plane.yaml -n openstack
  4. Wait until RHOCP creates the resources related to the OpenStackControlPlane CR. Run the following command to check the status:

    $ oc get openstackcontrolplane -n openstack
    NAME 						STATUS 	MESSAGE
    openstack-control-plane 	Unknown 	Setup started

    The OpenStackControlPlane resources are created when the status is "Setup complete".

    Tip

    Append the -w option to the end of the get command to track deployment progress.

  5. Optional: Confirm that the control plane is deployed by reviewing the pods in the openstack namespace for each of your cells:

    $ oc get pods -n openstack

    The control plane is deployed when all the pods are either completed or running.

Verification

  1. Access the remote shell for the OpenStackClient pod from your workstation:

    $ oc rsh -n openstack openstackclient
  2. Confirm that you can query prometheus and that the scrape endpoints are active:

    $ openstack metric query up --disable-rbac -c container -c instance  -c value

    Example output:

    +-----------------+------------------------+-------+
    | container       | instance               | value |
    +-----------------+------------------------+-------+
    | alertmanager    | 10.217.1.112:9093	   | 1     |
    | prometheus      | 10.217.1.63:9090 	   | 0     |
    | proxy-httpd     | 10.217.1.52:3000       | 1     |
    |                 | 192.168.122.100:9100   | 1     |
    |                 | 192.168.122.101:9100   | 1     |
    +-----------------+------------------------+-------+
    Note

    Each entry in the value field should be "1" when there are active workloads scheduled on the cluster, except for the prometheus container. The prometheus container reports a value of "0" due to TLS, which is enabled by default.

  3. You can find the openstack-telemetry-operator dashboards by clicking Observe and then Dashboards in the RHOCP console. For more information about RHOCP dashboards, see Reviewing monitoring dashboards as a cluster administrator in the RHOCP Monitoring Guide.

1.3. Customizing Telemetry configuration files

You can customize Telemetry services by creating custom configuration files for your deployment. For instance, you can customize the Ceilometer service by modifying the polling.yaml file.

Prerequisites

  • The pod for the service that you want to customize exists in your deployment and you are in the openstack namespace:

    $ oc get pods -A -o custom-columns="NAMESPACE:.metadata.namespace,POD:.metadata.name,CONTAINERS:.spec.containers[*].name" | grep <container>

Procedure

  1. Create the custom configuration file for the service you want to customize. If the service already has a configuration file, then you must give your custom file the same name so that your custom file replaces the existing configuration file. For example, to customize the configuration of the Ceilometer service, you can make a copy of polling.yaml for editing:

    $ oc rsh -n openstack -c ceilometer-central-agent ceilometer-0 cat /etc/ceilometer/polling.yaml > polling.yaml
  2. Add or update the configuration for the service as required. For example, to customize how often samples of volume and image size should be polled, create a file named polling.yaml and add the following configuration:

    sources:
      - name: pollsters
        interval: 300
        meters:
          - volume.size
          - image.size

    For more information on how to configure the Ceilometer service, see Polling properties.

  3. Create the Secret CR for the service with the custom configuration file:

    $ oc create secret generic <secret_name> \
     --from-file <custom_config.yaml> -n openstack
    Note

    You can specify the --from-file option as many times as required to pass more than one configuration file to the Secret CR for customizing the service.

  4. Verify that the Secret CR is created:

    $ oc describe secret <secret_name>
  5. Open the OpenStackControlPlane CR file on your workstation, for example, openstack_control_plane.yaml.
  6. Locate the service definition for telemetry and add the customConfigsSecretName field to the Telemetry service that you want to customize. The following example shows where to place the customConfigsSecretName field for each service that you can customize:

    spec:
      telemetry:
        template:
          autoscaling:
            aodh:
              customConfigsSecretName: <secret_name>
          ceilometer:
            customConfigsSecretName: <secret_name>
          cloudkitty:
            cloudKittyAPI:
              customConfigsSecretName: <secret_name>
            cloudKittyProc:
              customConfigsSecretName: <secret_name>
  7. Update the control plane:

    $ oc apply -f openstack_control_plane.yaml -n openstack

    Your custom file is copied from the Secret into the /etc/<service>/ folder. If a file with the same name already exists in the folder, it is replaced with the custom configuration file.

  8. Wait until RHOCP creates the resources related to the OpenStackControlPlane CR. Run the following command to check the status:

    $ oc get openstackcontrolplane -n openstack
  9. Verify that the custom configuration file is being used by the service:

    $ oc rsh -c <service_container> <service_pod> \
     cat /etc/<service>/<custom_config>.yaml

    Example:

    $ oc rsh -c ceilometer-central-agent ceilometer-0 cat /etc/ceilometer/polling.yaml

1.3.1. Polling properties

You can configure polling rules to poll for data not provided by service events and notifications, such as instance resource usage. Use the polling.yaml file to specify the polling plugins (pollsters) to enable, the interval they should be polled, and the meters to poll for each source.

The following template describes how to define a source to poll:

sources:
  - name: <source_name>
    interval: <sample_generation_interval>
    meters:
      - <meter_filter>
      - <meter_filter>
      ...
      - <meter_filter>
  • interval: The interval in seconds between sample generation of the specified metrics.
  • meters: A list of resources to gather samples from. Each filter must match the meter name of the polling plugin.

1.4. Enabling Telemetry power monitoring on the data plane

Important

This feature is available in this release as a Technology Preview, and therefore is not fully supported by Red Hat. It should only be used for testing, and should not be deployed in a production environment. For more information about Technology Preview features, see Scope of Coverage Details.

You can enable power monitoring on the data plane to collect power consumption metrics, by adding the telemetry-power-monitoring service to each OpenStackDataPlaneNodeSet custom resource (CR) defined for the data plane.

Procedure

  1. Open the OpenStackDataPlaneNodeSet CR definition file for the node set you want to update, for example, openstack_data_plane.yaml.
  2. Add the services field, and include all the required services, including the default services, then add telemetry-power-monitoring after telemetry:

    apiVersion: dataplane.openstack.org/v1beta1
    kind: OpenStackDataPlaneNodeSet
    metadata:
      name: openstack-data-plane
      namespace: openstack
    spec:
      tlsEnabled: true
      env:
        - name: ANSIBLE_FORCE_COLOR
          value: "True"
      services:
        - redhat
        - bootstrap
        - download-cache
        - configure-network
        - validate-network
        - install-os
        - configure-os
        - ssh-known-hosts
        - run-os
        - reboot-os
        - install-certs
        - ovn
        - neutron-metadata
        - libvirt
        - nova
        - telemetry
        - telemetry-power-monitoring

    For more information about deploying data plane services, see Deploying the data plane in the Deploying Red Hat OpenStack Services on OpenShift guide.

  3. Save the OpenStackDataPlaneNodeSet CR definition file.
  4. Apply the updated OpenStackDataPlaneNodeSet CR configuration:

    $ oc apply -f openstack_data_plane.yaml
  5. Verify that the data plane resource has been updated by confirming that the status is SetupReady:

    $ oc wait openstackdataplanenodeset openstack-data-plane --for condition=SetupReady --timeout=10m

    When the status is SetupReady the command returns a condition met message, otherwise it returns a timeout error.

    For information about the data plane conditions and states, see Data plane conditions and states in Deploying Red Hat OpenStack Services on OpenShift.

  6. Create a file on your workstation to define the OpenStackDataPlaneDeployment CR:

    apiVersion: dataplane.openstack.org/v1beta1
    kind: OpenStackDataPlaneDeployment
    metadata:
      name: <node_set_deployment_name>
    • Replace <node_set_deployment_name> with the name of the OpenStackDataPlaneDeployment CR. The name must be unique, must consist of lower case alphanumeric characters, - (hyphen) or . (period), and must start and end with an alphanumeric character.
    Tip

    Give the definition file and the OpenStackDataPlaneDeployment CR unique and descriptive names that indicate the purpose of the modified node set.

  7. Add the OpenStackDataPlaneNodeSet CR that you modified:

    spec:
      nodeSets:
        - <nodeSet_name>
  8. Save the OpenStackDataPlaneDeployment CR deployment file.
  9. Deploy the modified OpenStackDataPlaneNodeSet CR:

    $ oc create -f openstack_data_plane_deploy.yaml -n openstack

    You can view the Ansible logs while the deployment executes:

    $ oc get pod -l app=openstackansibleee -w
    $ oc logs -l app=openstackansibleee -f --max-log-requests 10

    If the oc logs command returns an error similar to the following error, increase the --max-log-requests value:

    error: you are attempting to follow 19 log streams, but maximum allowed concurrency is 10, use --max-log-requests to increase the limit
  10. Verify that the modified OpenStackDataPlaneNodeSet CR is deployed:

    $ oc get openstackdataplanedeployment -n openstack
    NAME             	STATUS   MESSAGE
    openstack-data-plane   True     Setup Complete
    
    
    $ oc get openstackdataplanenodeset -n openstack
    NAME             	STATUS   MESSAGE
    openstack-data-plane   True     NodeSet Ready

    For information about the meaning of the returned status, see Data plane conditions and states in the Deploying Red Hat OpenStack Services on OpenShift guide.

    If the status indicates that the data plane has not been deployed, then troubleshoot the deployment. For information, see Troubleshooting the data plane creation and deployment in the Deploying Red Hat OpenStack Services on OpenShift guide.

  11. Verify that the telemetry-power-monitoring service is deployed by checking for ceilometer_agent_ipmi and kepler containers in the data plane nodes:

    $ podman ps | grep -i -e ceilometer_agent_ipmi -e kepler

Chapter 2. Enabling RHOSO observability logging

Enable and configure Red Hat OpenStack Services on OpenShift (RHOSO) observability logging to collect, store, and access logs from your RHOSO environment. When observability logging is enabled on the control plane, you can enable the RHOSO observability logging service on the data plane.

2.1. Prerequisites

2.2. Enabling RHOSO observability logging on the control plane

To enable and configure observability logging on the control plane, you edit the Telemetry service in your OpenStackControlPlane custom resource (CR) file.

Procedure

  1. Open the OpenStackControlPlane CR definition file, openstack_control_plane.yaml, on your workstation.
  2. Update the telemetry section based on the needs of your environment:

     telemetry:
        enabled: true
        template:
          ...
          logging:
            enabled: true
            annotations:
              metallb.universe.tf/address-pool: internalapi
              metallb.universe.tf/allow-shared-ip: internalapi
              metallb.universe.tf/loadBalancerIPs: 172.17.0.80
    • logging.enabled: Set to true to enable observability logging.
    • logging.annotations.metallb.universe.tf/address-pool: Set to the RHOSO network that you want to use to transport the logs from the Compute nodes to the control plane.
    • logging.annotations.metallb.universe.tf/loadBalancerIPs: Set to the IP address that rsyslog sends messages to. Ensure that the IP address is reachable from the Compute node. The default IP address is the default VIP for internalapi, which is 172.17.0.80.
  3. Update the control plane:

    $ oc apply -f openstack_control_plane.yaml -n openstack
  4. Wait until RHOCP creates the resources related to the OpenStackControlPlane CR. Run the following command to check the status:

    $ oc get openstackcontrolplane -n openstack
    NAME 						STATUS 	MESSAGE
    openstack-control-plane 	Unknown 	Setup started

    The OpenStackControlPlane resources are created when the status is "Setup complete".

    Tip

    Append the -w option to the end of the get command to track deployment progress

Verification

  1. Open the logging pane in the OpenShift Console.
  2. Select Observe → Logs.
  3. From the drop-down menu, choose Application.
  4. Verify that pod logs from the RHOSO control plane services are present.

Additional resources

2.3. Enabling RHOSO observability logging on the data plane

You can enable Red Hat OpenStack Services on OpenShift (RHOSO) observability logging on the data plane by adding the OpenStackDataPlaneService logging service to the services list of each OpenStackDataPlaneNodeSet custom resource (CR) defined for the data plane.

Prerequisites

  • RHOSO observability logging is enabled on the control plane.

Procedure

  1. Open the OpenStackDataPlaneNodeSet CR definition file for the node set you want to update, for example, openstack_data_plane.yaml.
  2. Add the services field, and include all the required services, including the default services, then add logging after telemetry:

    apiVersion: dataplane.openstack.org/v1beta1
    kind: OpenStackDataPlaneNodeSet
    metadata:
      name: openstack-data-plane
      namespace: openstack
    spec:
      tlsEnabled: true
      env:
        - name: ANSIBLE_FORCE_COLOR
          value: "True"
      services:
        - redhat
        - bootstrap
        - download-cache
        - configure-network
        - validate-network
        - install-os
        - configure-os
        - ssh-known-hosts
        - run-os
        - reboot-os
        - install-certs
        - ovn
        - neutron-metadata
        - libvirt
        - nova
        - telemetry
        - logging
  3. Save the OpenStackDataPlaneNodeSet CR definition file.
  4. Apply the updated OpenStackDataPlaneNodeSet CR configuration:

    $ oc apply -f openstack_data_plane.yaml
  5. Verify that the data plane resource has been updated by confirming that the status is SetupReady:

    $ oc wait openstackdataplanenodeset openstack-data-plane --for condition=SetupReady --timeout=10m

    When the status is SetupReady the command returns a condition met message, otherwise it returns a timeout error.

    For information about the data plane conditions and states, see Data plane conditions and states in Deploying Red Hat OpenStack Services on OpenShift.

  6. Create a file on your workstation to define the OpenStackDataPlaneDeployment CR:

    apiVersion: dataplane.openstack.org/v1beta1
    kind: OpenStackDataPlaneDeployment
    metadata:
      name: <node_set_deployment_name>
    • Replace <node_set_deployment_name> with the name of the OpenStackDataPlaneDeployment CR. The name must be unique, must consist of lower case alphanumeric characters, - (hyphen) or . (period), and must start and end with an alphanumeric character.
    Tip

    Give the definition file and the OpenStackDataPlaneDeployment CR unique and descriptive names that indicate the purpose of the modified node set.

  7. Add the OpenStackDataPlaneNodeSet CR that you modified:

    spec:
      nodeSets:
        - <nodeSet_name>
  8. Save the OpenStackDataPlaneDeployment CR deployment file.
  9. Deploy the modified OpenStackDataPlaneNodeSet CR:

    $ oc create -f openstack_data_plane_deploy.yaml -n openstack

    You can view the Ansible logs while the deployment executes:

    $ oc get pod -l app=openstackansibleee -w
    $ oc logs -l app=openstackansibleee -f --max-log-requests 10

    If the oc logs command returns an error similar to the following error, increase the --max-log-requests value:

    error: you are attempting to follow 19 log streams, but maximum allowed concurrency is 10, use --max-log-requests to increase the limit
  10. Verify that the modified OpenStackDataPlaneNodeSet CR is deployed:

    $ oc get openstackdataplanedeployment -n openstack
    NAME             	STATUS   MESSAGE
    openstack-data-plane   True     Setup Complete
    
    
    $ oc get openstackdataplanenodeset -n openstack
    NAME             	STATUS   MESSAGE
    openstack-data-plane   True     NodeSet Ready

    For information about the meaning of the returned status, see Data plane conditions and states in the Deploying Red Hat OpenStack Services on OpenShift guide.

    If the status indicates that the data plane has not been deployed, then troubleshoot the deployment. For information, see Troubleshooting the data plane creation and deployment in the Deploying Red Hat OpenStack Services on OpenShift guide.

  11. If you added a new node to the node set, then map the node to the Compute cell it is connected to:

    $ oc rsh nova-cell0-conductor-0 nova-manage cell_v2 discover_hosts --verbose

    If you did not create additional cells, this command maps the Compute nodes to cell1.

    Access the remote shell for the openstackclient pod and verify that the deployed Compute nodes are visible on the control plane:

    $ oc rsh -n openstack openstackclient
    $ openstack hypervisor list

Verification

  1. Open the logging pane in the OpenShift Console.
  2. Click Observe and then click Logs.
  3. From the drop-down menu, choose Infrastructure.
  4. Verify that Journald logs from the Compute nodes are present.

Additional resources

Chapter 3. Enabling cloud rating in a RHOSO environment

You can use the Rating service (cloudkitty) to monitor, collect, rate and store metrics data about service and resource consumption in a RHOSO cloud.

Warning

The content for this feature is available in this release as a Documentation Preview, and therefore is not fully verified by Red Hat. Use it only for testing, and do not use in a production environment.

3.1. Prerequisites

The Telemetry Operator uses a Keystone fetcher to retrieve a list of projects to rate and a Prometheus collector to collect data from a Prometheus source for a given project and metric. The Telemetry Operator does not support using a Gnocchi collector or any other type of fetcher.

Note

The Telemetry Operator does not support the Rating service reprocessing API.

3.2. Enabling the Rating service on the control plane

You can enable the Rating service (cloudkitty) in the Telemetry Operator to provide chargeback and rating capabilities to RHOSO clouds.

Procedure

  1. Create a Secret CR on your workstation to connect Loki to the AWS S3 object storage you use for the Loki log store:

    apiVersion: v1
    kind: Secret
    metadata:
      name: cloudkitty-loki-s3
      namespace: openstack
    stringData:
      access_key_id: <your_key_id>
      access_key_secret: <your_key_secret>
      bucketnames: <your_bucket_name>
      endpoint: <your_s3_endpoint>
  2. Create the Secret CR in the cluster:

    $ oc create -f loki_s3_secret.yaml -n openstack
  3. Verify that the Secret CR is created:

    $ oc describe secret logging-loki-s3 -n openstack
  4. Open the OpenStackControlPlane CR file on your workstation, for example, openstack_control_plane.yaml.
  5. Locate the service definition for telemetry and check that Prometheus is enabled in your cluster by ensuring that metricStorage is enabled:

      telemetry:
        enabled: true
          template:
            metricStorage:
              enabled: true

    For more information about configuring metricStorage, see Configuring observability.

  6. Check how long metrics data is being retained for, and ensure that they are not being stored for more than 7 days:

          template:
            metricStorage:
              enabled: true
              ...
              monitoringStack:
                ...
                storage:
                  strategy: persistent
                  retention: 24h
  7. Add the following cloudkitty configuration to enable the Ratings service:

      telemetry:
        enabled: true
          template:
            ...
            cloudkitty:
              enabled: true
              s3StorageConfig:
                schemas:
                - effectiveDate: "2024-11-18"
                  version: v13
                secret:
                  name: cloudkitty-loki-s3
                  type: s3
  8. Ensure that the Loki Operator log store is sized for your environment. For more information, see Loki deployment sizing.
  9. Update the control plane:

    $ oc apply -f openstack_control_plane.yaml -n openstack
  10. Wait until RHOCP creates the resources related to the OpenStackControlPlane CR. Run the following command to check the status:

    $ oc get openstackcontrolplane -n openstack
    NAME              STATUS  MESSAGE
    openstack-control-plane   Unknown   Setup started

    The OpenStackControlPlane resources are created when the status is "Setup complete".

    Tip

    Append the -w option to the end of the get command to track deployment progress.

  11. Confirm that the control plane is deployed by reviewing the pods in the openstack namespace:

    $ oc get pods -n openstack

    The control plane is deployed when all the pods are either completed or running.

Verification

  1. Access the remote shell for the openstackclient pod and verify that the Rating service is enabled:

    $ oc rsh -n openstack openstackclient
    $ openstack rating module list
    +-----------+---------+----------+
    | Module    | Enabled | Priority |
    +-----------+---------+----------+
    | hashmap   | False   | 1        |
    | noop      | True    | 1        |
    | pyscripts | False   | 1        |
    +-----------+---------+----------+
  2. Verify that you can enable, disable and change the priority of each module:

    • Enable the hashmap rating module:

      $ openstack rating module enable hashmap
      +---------+---------+----------+
      | Module  | Enabled | Priority |
      +---------+---------+----------+
      | hashmap | True    | 1        |
      +---------+---------+----------+
    • Disable the pyscripts rating module:

      $ openstack rating module disable pyscripts
      +-----------+---------+----------+
      | Module    | Enabled | Priority |
      +-----------+---------+----------+
      | pyscripts | False   | 1        |
      +-----------+---------+----------+
    • Set the hashmap rating module priority to 100:

      $ openstack rating module set priority hashmap 100
      +---------+---------+----------+
      | Module  | Enabled | Priority |
      +---------+---------+----------+
      | hashmap | True    | 100      |
      +---------+---------+----------+

3.3. Configuring metrics for rating rules

Rating rules assign a value to the consumption of computing resources. Use a custom metrics.yaml file to define the specific metrics that the Rating service (cloudkitty) processes for your deployment.

Procedure

  1. Copy the default metrics.yaml configuration file to define the rating metrics to collect for your deployment:

    $ oc rsh cloudkitty-proc-0 cat /etc/cloudkitty/metrics.yaml > metrics.yaml
    Note

    You must name your custom file metrics.yaml so that your custom file replaces the existing configuration file.

  2. Retrieve a list of the names of the available metrics that you can configure:

    $ oc rsh openstackclient openstack metric list
  3. Define how each metric is collected. The following example defines how to collect the ceilometer_image_size metric:

    metrics:
      ceilometer_image_size:
        unit: MiB
        groupby:
          - resource
          - project
        metadata:
          - container_format
          - disk_format
        extra_args:
          aggregation_method: max
    • ceilometer_image_size: Specifies the name of the metric to collect. The name must match the name of the metric in Prometheus.
    • unit: Specifies a string to indicate the unit in which the metric is stored after conversion. This string is for information only, it has no impact on the metric conversion or rating.
    • groupby: Specifies the name of an attribute by which the metrics should be grouped on collection. You can specify as many groupby attributes as required. For example, you can group data by ID, project ID (project), domain ID and user ID (user).

    For more information on metric rating rules you can configure, see Defining the Rating service metrics to collect.

  4. Optional: If you need to convert the collected quantity from one unit to another, for example, from MiB to GiB, configure the factor and offset options. The factor and offset options are used to calculate the final collected quantity with the following formula:

    quantity = collected_quantity * factor + offset.

    For example, to convert the collected quantity from B to MiB:

    metrics:
      ceilometer_image_size:
        unit: MiB # Final unit
        ...
        factor: 1/1048576 # Dividing by 1024 * 1024
  5. To rate different aspects of the same resource type that is collected through the same metric, you can add additional rating types to a single metric. For example, to rate the use of both the flavor and the used CPU time by an instance, define two units for the ceilometer_cpu metric:

      ceilometer_cpu:
        - unit: instance
          alt_name: flavor
          groupby:
            - resource
            - user
            - project
          metadata:
            - flavor_name
            - flavor_id
          mutate: NUMBOOL
          extra_args:
            aggregation_method: max
        - unit: ns
          alt_name: cpu_time
          groupby:
            - resource
            - user
            - project
          mutate: NUMBOOL
          extra_args:
            aggregation_method: max
  6. Create the Secret CR for the Rating service with the custom configuration files:

    $ oc create secret generic cloudkitty_secret \
     --from-file metrics.yaml -n openstack
    Note

    You can specify the --from-file option as many times as required to pass more than one configuration file to the Secret CR for customizing the service.

  7. Verify that the Secret CR is created:

    $ oc describe secret cloudkitty_secret
  8. Open the OpenStackControlPlane CR file on your workstation, for example, openstack_control_plane.yaml.
  9. Locate the service definition for telemetry and add the customConfigsSecretName field to the cloudkitty service:

    spec:
      telemetry:
        template:
          ...
          cloudkitty:
            cloudKittyAPI:
              customConfigsSecretName: cloudkitty_secret
            cloudKittyProc:
              customConfigsSecretName: cloudkitty_secret
  10. Update the control plane:

    $ oc apply -f openstack_control_plane.yaml -n openstack

    Your custom metrics.yaml file is copied from the Secret into the /etc/cloudkitty/ folder, replacing the default configuration file.

  11. Wait until RHOCP creates the resources related to the OpenStackControlPlane CR. Run the following command to check the status:

    $ oc get openstackcontrolplane -n openstack
  12. Verify that the custom configuration file is being used by the service:

    $ oc rsh -c cloudkitty-proc cloudkitty-proc-0 \
     cat /etc/cloudkitty/metrics.yaml

3.4. Defining the Rating service metrics to collect

The metrics to collect by the Rating service (cloudkitty) are configured in the metrics.yaml file.

The following template describes how to define the metrics to collect:

metrics:
  <metric_name>:
    unit: <unit_indicator>
    alt_name: <alternative_name>
    description: <metric_description>
    groupby:
      - <groupby_attribute>
      - <groupby_attribute>
      ...
      - <groupby_attribute>
    metadata:
      - <metadata_attribute>
    factor: <unit_conversion_factor>
    offset: <unit_conversion_offset>
    mutate: <unit_mutation>
    extra_args:
      - <list_of_optional_arguments>
  • Replace <metric_name> with the name of the metric to collect. This must match the name of the metric in Prometheus, for example, ceilometer_cpu. Use the following command to retrieve the names of the available metrics that you can configure:

    $ oc rsh openstackclient openstack metric list
  • unit: A string that indicates the unit in which the metric is stored after conversion. This string is for information only, it has no impact on the metric conversion or rating. For example, the unit could be set to GiB for the metric ceilometer_disk_root_size.
  • groupby: A list of attribute names by which the metrics should be grouped on collection. For example, you can group data by ID, project ID (project), domain ID and user ID (user).
  • metadata: A list of additional attribute names that should be collected for the given metric, that the metrics are not to be grouped on. For example, you can collect the type attribute for the ceilometer_disk_root_size metric. The metadata attributes can be used for rating rules and they appear in summary reports.
  • factor: An optional factor to use in calculating the conversion rate for the unit quantity. The default factor is 1. You can specify the factor as a float, integer or fraction.
  • offset: An optional offset to use in calculating the conversion rate for the unit quantity. The default offset is 0. You can specify the offset as a float, integer or fraction.
  • mutate: An optional operation to apply to the unit quantity after the conversion rate is calculated. You can use one of the following valid values:

    • NONE: This is the default. The collected data is not modified.
    • CEIL: The quantity is rounded up to the closest integer.
    • FLOOR: The quantity is rounded down to the closest integer.
    • NUMBOOL: If the collected quantity equals 0, leave it at 0. Else, set it to 1.
    • NOTNUMBOOL: If the collected quantity equals 0, set it to 1. Else, set it to 0.
    • MAP: Map arbitrary values to new values as defined through the mutate_map option (dictionary). If the value is not found in mutate_map, set it to 0. If mutate_map is not defined or is empty, all values are set to 0. For example, the following Prometheus metric has a value of 0 when the instance is in ACTIVE state, but operators may want to rate other non-zero states:

      metrics:
        openstack_nova_server_status:
          unit: instance
          mutate: MAP
          mutate_map:
            0.0: 1.0  # ACTIVE
            11.0: 1.0 # SHUTOFF
            12.0: 1.0 # SUSPENDED
            16.0: 1.0 # PAUSED
          groupby:
            - id
          metadata:
            - flavor_id
  • alt_name: An optional string to use as the display name for the metric instead of the metric name in summary reports.
  • description: An optional field for providing details about the configured rating type. The value is of type string up to 64 kB. When configured, this option is persisted as rating metadata and it is available through the summary GET API.
  • extra_args: An optional list of arguments to apply to the rating. You can use the following valid arguments:

    • aggregation_method: Defaults to max. The aggregation method to use when retrieving measures from Prometheus. Must be one of avg, min, max, sum, count, stddev, stdvar.
    • query_function: The function to apply to an instant vector after the aggregation_method or range_function has altered the data. Must be one of abs, ceil, exp, floor, ln, log2, log10, round, sqrt.
    • query_prefix: An arbitrary prefix to add to the Prometheus query generated by the Rating service, separated by a space.
    • query_suffix: An arbitrary suffix to add to the Prometheus query generated by the Rating service, separated by a space.
    • range_function: The function to apply instead of the implicit {aggregation_method}_over_time. Must be one of changes, delta, deriv, idelta, irange, irate, rate.

Chapter 4. RHOSO network observability

In Red Hat OpenStack Services on OpenShift (RHOSO) environments, you can use Red Hat OpenShift Container Platform (RHOCP) observability features and the RHOSO Telemetry service to access metrics for many aspects of your RHOSO deployment.

The Telemetry service includes Ceilometer and Prometheus.

Prometheus collects metrics from various components. It scrapes exporter endpoints on data plane nodes and control plane pods. You can configure Prometheus to drive a variety of functions including data logging, metrics viewing and queries on the RHOCP Observability dashboard, alarms, and autoscaling.

The RHOSO OpenStack network exporter (openstack-network-exporter) is a Prometheus exporter that runs on data plane nodes and control plane pods and collects OVN, OVS, and OVS-DPDK networking metrics to be scraped by Prometheus.

The RHOSO OpenStack network exporter collects a variety of metrics by default. You can access them through the RHOCP console dashboards. You can also scrape the network exporter endpoints with tools such as curl.

In addition to the default metrics collection, you can enable and disable the collection of extended metrics sets.

4.1. Available network metrics

The network exporter exposes OVN Controller metrics, OVS metrics, and OVS-DPDK PMD metrics. For example:

OVN southbound database server metrics examples
  • ovn_raft_cluster_server_vote: A metric with a constant 1 value labeled by database name, cluster uuid, server uuid and vote.
  • ovn_raft_cluster_inbound_connections_total: total Number of inbound connections to the server labeled by database name, cluster uuid, and server uuid.
  • ovn_raft_cluster_leader: Value 1.0 if the server is the cluster leader for the given database or 0.0 if it is not, labeled by database name, cluster uuid, and server uuid.
OVN northd metrics examples
  • ovn_northd_pstream_open_total: Number of times that passive connections were opened for the remote peer to connect.
  • ovn_northd_txn_aborted_total: Number of times the OVSDB transaction has been aborted.
OVN Controller metrics examples
  • ovnc_lflow_run: The number of times ovn-controller translated the Logical_Flow table in the OVN SB database into OpenFlow flows in the interval.
  • ovn_txn_error: number of times the OVSDB transaction has errored out.
OVN logical router and logical router port metrics examples
  • ovnc_router_port_traffic_pkts: Number of packets transmitted and received by a logical router port labeled by the logical datapath number and the logical port number.
  • ovnc_router_port_traffic_bytes: Number of bytes transmitted and received by a logical router port labeled by the logical datapath number and the logical port number
OVS metrics examples
  • ovs_bridge_port_count: Number of ports in a bridge, labeled by bridge name
  • ovs_bridge_flow_count: Number of openflow rules configured on a bridge, labeled by bridge name
OVS-DPDK PMD metrics examples
  • ovs_pmd_tx_packets: Rate of PMD packets transmits during the interval
  • ovs_pmd_rx_packets: Rate of PMD packets received during the interval

You can view all of the available metrics and descriptions by running curl commands on a Compute node or control plane pod. For more information, see Accessing live network metrics from the command line with curl.

4.2. Accessing network metrics snapshots with the RHOCP console

In Red Hat OpenStack Services on OpenShift (RHOSO) environments, you can view changes to a subset of the available networking metrics in charts on the RHOCP Console.

Prerequisites

Procedure

  1. On the RHOCP Console, go to Observe > Dashboards.
  2. Choose Dashboard > Openstack/Openstack Network Exporter.
  3. (Optional) Select a Compute node from the Instance menu.
  4. (Optional) Select a Time Range and Refresh Interval.

4.3. Accessing live network metrics from the command line with curl

The OpenStack network exporter pushes metrics to endpoints. You can log in to a Compute node and use the curl command to get metrics from those endpoints.

Prerequisites

  • The administrator has created a project for you and has provided you with a clouds.yaml file for you to access the cloud.
  • The python-openstackclient package resides on your workstation.

    $ dnf list installed python-openstackclient
  • You can log into Compute nodes.

Procedure

  1. Log into a Compute node.
  2. Example: To view all available metrics, run the command:

    curl -k https://localhost:9105/metrics | more
  3. Example: To view link state metrics, run the command:

    curl -k -s  https://localhost:9105/metrics | grep ovs_interface_link_state
    Sample output
    # HELP ovs_interface_link_state The link state of the interface. Possible values are: up(1), down(0) or unknown(-1).
    # TYPE ovs_interface_link_state gauge
    ovs_interface_link_state{bridge="br-ex",interface="br-ex",port="br-ex",type="internal"} 1
    ovs_interface_link_state{bridge="br-ex",interface="eth1",port="eth1",type="system"} 1
    ovs_interface_link_state{bridge="br-ex",interface="patch-provnet-e0353205-c937-4ebb-af35-a7db0d85c9d3-to-br-int",port="patch-provnet-e0353205-c937-4ebb-af35-a7db0d85c9d3-to-br-int",type="patch"} 1
    ovs_interface_link_state{bridge="br-ex",interface="vlan20",port="vlan20",type="internal"} 1
    ovs_interface_link_state{bridge="br-ex",interface="vlan21",port="vlan21",type="internal"} 1
    ovs_interface_link_state{bridge="br-ex",interface="vlan22",port="vlan22",type="internal"} 1
    ovs_interface_link_state{bridge="br-ex",interface="vlan23",port="vlan23",type="internal"} 1
    ovs_interface_link_state{bridge="br-int",interface="br-int",port="br-int",type="internal"} 0
    ovs_interface_link_state{bridge="br-int",interface="ovn-3cc1e2-0",port="ovn-3cc1e2-0",type="geneve"} 1
    ovs_interface_link_state{bridge="br-int",interface="ovn-737fe9-0",port="ovn-737fe9-0",type="geneve"} 1
    ovs_interface_link_state{bridge="br-int",interface="ovn-7488f5-0",port="ovn-7488f5-0",type="geneve"} 1
    ovs_interface_link_state{bridge="br-int",interface="ovn-87e045-0",port="ovn-87e045-0",type="geneve"} 1
    ovs_interface_link_state{bridge="br-int",interface="ovn-acb76a-0",port="ovn-acb76a-0",type="geneve"} 1
    ovs_interface_link_state{bridge="br-int",interface="patch-br-int-to-provnet-e0353205-c937-4ebb-af35-a7db0d85c9d3",port="patch-br-int-to-provnet-e0353205-c937-4ebb-af35-a7db0d85c9d3",type="patch"} 1
    ovs_interface_link_state{bridge="br-int",interface="tap32b26818-e0",port="tap32b26818-e0",type="system"} 1
    ovs_interface_link_state{bridge="br-int",interface="tapba83f025-aa",port="tapba83f025-aa",type="system"} 1

4.4. Enabling and disabling extended metrics

Enable extended poll mode driver (PMD) metrics to debug your RHOSO networks, and then disable the metrics when you are finished.

Note

Extended PMD metrics collection is disabled by default because PMD metrics collection can increase CPU load and impact performance.

Prerequisites

  • You can log into Compute nodes as root.

Procedure

  1. Log in to a Compute node.
  2. Reset and enable extended PMD metrics:

    # ovs-appctl dpif-netdev/pmd-stats-clear
    # ovs-vsctl set Open_vSwitch . other_config:pmd-perf-metrics=true
  3. Disable extended PMD metrics:

    # ovs-vsctl set Open_vSwitch . other_config:pmd-perf-metrics=false

Chapter 5. Monitor RHOSO networks with alerts

You can create alerts to monitor your RHOSO networks based on any of the OVS, OVN, and OVS-DPDK metrics available through the RHOSO OpenStack network exporter (openstack-network-exporter).

For example, you can create an alert that activates when an interface resets (flaps) more than three times in five minutes. This alert would monitor the ovs_interface_link_resets metric.

In your RHOSO environment, you use standard RHOSO, OpenShift, and Prometheus monitoring toolkit management tools to create, manage, and consume alerts.

When you deploy a RHOSO environment as instructed in Creating the control plane in Deploying Red Hat OpenStack Services on OpenShift, the deployment is already enabled with the ability to create alerts.

5.1. Consuming RHOSO network alerts

You can receive alerts about your RHOSO environment, including OVS and OVN networking alerts, in several ways:

5.2. RHOSO alert structure and management

RHOSO network alerts are managed by the open source monitoring and alerting toolkit Prometheus. Prometheus includes a time-series database, methods for fetching metrics from endpoints in your deployment, and the Alertmanager feature for managing alerts.

RHOSO uses the Cluster Observability Operator (COO) to deploy and manage these tools. The COO is deployed automatically in your environment by the OpenStack Operator.

You create alerts within custom resources (CR) that define Prometheus rule objects. Within a Prometheus rule object you create groups of rules. Each group contains one or more rules. Each rule contains one or more alerts. Each alert contains a PromQL expression, which uses the Prometheus query language to define the conditions that fire the alert.

Alert states include pending, firing, and inactive.

5.2.1. Alert rule structure example in a CR file

The following example shows the general structure of a PrometheusRule object definition, including two groups.

PrometheusRule object
    groups:
   - name: group1
     rules:
        - alert: AlertName1
            expr: ... <PromQL expression>
        - alert: AlertName2
            expr: ...
   - name: group2
      rules:
         - alert: ...

5.3. PromQL expressions use metrics to define alert conditions

The functional center of an alert rule is a Prometheus Query Language (PromQL) expression. You can use PromQL to select, aggregate, and analyze environmental data in real time in the Prometheus monitoring system.

You can create a PromQL expression using any available metrics, including the OVS, OVN, and OVS-DPDK network metrics exposed by the RHOSO OpenStack network exporter (openstack-network-exporter). In the following example, the PromQL expression in the expr field defines an alert that fires in response to excessive interface resets.

rules:
      - alert: OVSInterfaceLinkFlappingWarning
        expr: |
          (
            increase(ovs_interface_link_resets[5m]) > 3
          )
        for: 1m

For information on the OVS and OVN network metrics exposed by the openstack-network-exporter, see RHOSO network observability.

5.4. Silence RHOSO alerts

You can silence an alert for a specific time period. For example, you can silence an alert during a scheduled maintenance period, or to stop notifications when you are troubleshooting a known issue. The silenced alert fires, but Alertmanager does not send notifications.

For more information, see Managing alerts in Red Hat OpenShift Container Platform Monitoring.

5.5. Create RHOSO network alerts

To set up alerts that notify you of important operational conditions, you create alert groups, alert rules, alerts, and alert expressions in custom resource definitions (CRD). The CRD uses apiVersion: monitoring.rhobs/v1 and kind: PrometheusRule.

Prerequisites

  • The Red Hat OpenStack Services on OpenShift (RHOSO) environment is deployed on a Red Hat OpenShift Container Platform (RHOCP) cluster. For more information, see Deploying Red Hat OpenStack Services on OpenShift.
  • You are logged on to a workstation that has access to the RHOCP cluster, as a user with cluster-admin privileges.
  • The Telemetry service is enabled and configured on the control plane. For more information, see the telemetry service configuration example under Add the following service configurations in Creating the control plane in Deploying Red Hat OpenStack Services on OpenShift.

Procedure

  1. Create a file on your workstation to define the PrometheusRule CR. For example, openstack-observability-services-alerts.yaml.

    apiVersion: monitoring.rhobs/v1
    kind: PrometheusRule
    metadata:
      labels:
        service: metricStorage
      name: openstack-observability-services-alerts
      namespace: openstack
    spec:
      groups:
        - name: openstack-observability.ovs.interface
          rules:
          - alert: OVSInterfaceLinkFlappingWarning
            expr: |
              (
                increase(ovs_interface_link_resets[5m]) > 3
              )
            for: 1m
            labels:
              severity: warning
            annotations:
              summary: "OVS interface link flapping (warning)"
              description: |
                Interface {{ $labels.interface }} on {{ $labels.fqdn }} has more than 3 link resets
                in the last 5 minutes. Bridge: {{ $labels.bridge }}
  2. Create the PrometheusRule object:

    $ oc create -f openstack-observability-services-alerts.yaml

    The Cluster Observability Operator (COO) loads the rule into Prometheus.

  3. Verify that the COO loaded the rules into Prometheus:

    $ oc get prometheusrules.monitoring.rhobs -n openstack
    Note

    You must pass the entire CRD name, prometheusrules.monitoring.rhobs, because there is a different PrometheusRule CRD that provides the rules for the RHOCP Monitoring API, prometheusrules.monitoring.coreos.

Legal Notice

Copyright © Red Hat.
Except as otherwise noted below, the text of and illustrations in this documentation are licensed by Red Hat under the Creative Commons Attribution–Share Alike 3.0 Unported license . If you distribute this document or an adaptation of it, you must provide the URL for the original version.
Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.
Red Hat, the Red Hat logo, JBoss, Hibernate, and RHCE are trademarks or registered trademarks of Red Hat, LLC. or its subsidiaries in the United States and other countries.
Linux® is the registered trademark of Linus Torvalds in the United States and other countries.
XFS is a trademark or registered trademark of Hewlett Packard Enterprise Development LP or its subsidiaries in the United States and other countries.
The OpenStack® Word Mark and OpenStack logo are trademarks or registered trademarks of the Linux Foundation, used under license.
All other trademarks are the property of their respective owners.