Configuring observability
Configuring the Telemetry service (ceilometer, prometheus) to manage observability metrics for a Red Hat OpenStack Services on OpenShift deployment
Abstract
Providing feedback on Red Hat documentation
We appreciate your feedback. Tell us how we can improve the documentation.
To provide documentation feedback for Red Hat OpenStack Services on OpenShift (RHOSO), create a Jira issue in the OSPRH Jira project.
Procedure
- Log in to the Red Hat Atlassian Jira.
- Click the following link to open a Create Issue page: Content from redhat.atlassian.net is not included.Create issue
- Complete the Summary and Description fields. In the Description field, include the documentation URL, chapter or section number, and a detailed description of the issue.
- Click Create.
- Review the details of the bug you created.
Chapter 1. Configuring observability
Use observability to gain insights into the metrics, logs, and alerts from your Red Hat OpenStack Services on OpenShift (RHOSO) deployment. You can configure observability by editing the default Telemetry service (ceilometer, prometheus) in your OpenStackControlPlane custom resource (CR) file.
1.1. RHOSO observability architecture
The observability architecture in Red Hat OpenStack Services on OpenShift (RHOSO) is composed of services within Red Hat OpenShift Container Platform (RHOCP), as well as services on your Compute nodes that provide metrics, logs, and alerts. You can use Red Hat OpenShift Observability for insight into your RHOSO environment and for collecting, storing, and searching through logs.
The observability platform available with RHOSO does not guarantee the delivery of metrics. Metrics are exposed for scraping but they are not cached. If data is dropped there is no ability to retrospectively fill in gaps in the data, which might result in incomplete metrics.
1.2. Configuring observability on the control plane
The Telemetry service (ceilometer, prometheus) is enabled by default in a Red Hat OpenStack Services on OpenShift (RHOSO) deployment. You can configure observability by editing the Telemetry service in your OpenStackControlPlane custom resource (CR) file.
Prerequisites
-
The control plane includes initial configuration of the Telemetry service. For more information, see the
telemetryconfiguration in Creating the control plane in Deploying Red Hat OpenStack Services on OpenShift.
Procedure
-
Open your
OpenStackControlPlaneCR file,openstack_control_plane.yaml, on your workstation. Configure the Telemetry service,
telemetry, as required for your environment:telemetry: enabled: true template: metricStorage: enabled: true dashboardsEnabled: true dataplaneNetwork: ctlplane networkAttachments: - ctlplane monitoringStack: alertingEnabled: true scrapeInterval: 30s storage: strategy: persistent retention: 24h persistent: pvcStorageRequest: 20G autoscaling: enabled: false aodh: notificationsBus: cluster: rabbitmq-notification databaseAccount: aodh databaseInstance: openstack secret: osp-secret heatInstance: heat ceilometer: enabled: true notificationsBus: cluster: rabbitmq-notification secret: osp-secret logging: enabled: false annotations: metallb.universe.tf/address-pool: internalapi metallb.universe.tf/allow-shared-ip: internalapi metallb.universe.tf/loadBalancerIPs: 172.17.0.80 cloudkitty: enabled: false messagingBus: cluster: rabbitmq s3StorageConfig: schemas: - effectiveDate: "2024-11-18" version: v13 secret: name: cloudkitty-loki-s3 type: s3-
metricStorage.monitoringStack.scrapeInterval: Specifies the interval at which new metrics are gathered. Changing this interval can affect performance. -
metricStorage.monitoringStack.storage.retention: Specifies the length of time that telemetry metrics are stored. The duration affects the amount of storage required. -
storage.persistent.pvcStorageRequest: Specifies the amount of storage to allocate to the Prometheus time series database. -
autoscaling.enabled: Set totrueto enable autoscaling. Theautoscalingfield must be present even when autoscaling is disabled. For more information about autoscaling, see Autoscaling for Instances. -
ceilometer.enabled: Set tofalseto disable theceilometerservice. If you do not disable ceilometer, then a Prometheus metrics exporter is created and exposed from inside the cluster at the following URL: Content from ceilometer-internal.openstack.svc is not included.http://ceilometer-internal.openstack.svc:3000/metrics -
logging.enabled: Set totrueto enable observability logging. For more information about configuring observability logging, see Enabling RHOSO observability logging. -
cloudkitty.enabled: Set totrueto enable the Rating service (cloudkitty). For more information about configuring chargeback and rating capabilities, see Enabling the Rating service on the control plane. -
aodh.notificationsBus.clusterandceilometer.notificationsBus.cluster: Set to a dedicated RabbitMQ cluster for notifications as in this example or to a combined RabbitMQ cluster for both RPC and notifications as in the default RHOSO deployment that uses the combinedrabbitmqcluster.
-
Update the control plane:
$ oc apply -f openstack_control_plane.yaml -n openstack
Wait until RHOCP creates the resources related to the
OpenStackControlPlaneCR. Run the following command to check the status:$ oc get openstackcontrolplane -n openstack NAME STATUS MESSAGE openstack-control-plane Unknown Setup started
The OpenStackControlPlane resources are created when the status is "Setup complete".
TipAppend the
-woption to the end of thegetcommand to track deployment progress.Optional: Confirm that the control plane is deployed by reviewing the pods in the
openstacknamespace for each of your cells:$ oc get pods -n openstack
The control plane is deployed when all the pods are either completed or running.
Verification
Access the remote shell for the
OpenStackClientpod from your workstation:$ oc rsh -n openstack openstackclient
Confirm that you can query
prometheusand that the scrape endpoints are active:$ openstack metric query up --disable-rbac -c container -c instance -c value
Example output:
+-----------------+------------------------+-------+ | container | instance | value | +-----------------+------------------------+-------+ | alertmanager | 10.217.1.112:9093 | 1 | | prometheus | 10.217.1.63:9090 | 0 | | proxy-httpd | 10.217.1.52:3000 | 1 | | | 192.168.122.100:9100 | 1 | | | 192.168.122.101:9100 | 1 | +-----------------+------------------------+-------+
NoteEach entry in the value field should be "1" when there are active workloads scheduled on the cluster, except for the
prometheuscontainer. Theprometheuscontainer reports a value of "0" due to TLS, which is enabled by default.-
You can find the
openstack-telemetry-operatordashboards by clickingObserveand thenDashboardsin the RHOCP console. For more information about RHOCP dashboards, see Reviewing monitoring dashboards as a cluster administrator in the RHOCP Monitoring Guide.
1.3. Customizing Telemetry configuration files
You can customize Telemetry services by creating custom configuration files for your deployment. For instance, you can customize the Ceilometer service by modifying the polling.yaml file.
Prerequisites
The pod for the service that you want to customize exists in your deployment and you are in the
openstacknamespace:$ oc get pods -A -o custom-columns="NAMESPACE:.metadata.namespace,POD:.metadata.name,CONTAINERS:.spec.containers[*].name" | grep <container>
Procedure
Create the custom configuration file for the service you want to customize. If the service already has a configuration file, then you must give your custom file the same name so that your custom file replaces the existing configuration file. For example, to customize the configuration of the Ceilometer service, you can make a copy of
polling.yamlfor editing:$ oc rsh -n openstack -c ceilometer-central-agent ceilometer-0 cat /etc/ceilometer/polling.yaml > polling.yaml
Add or update the configuration for the service as required. For example, to customize how often samples of volume and image size should be polled, create a file named
polling.yamland add the following configuration:sources: - name: pollsters interval: 300 meters: - volume.size - image.sizeFor more information on how to configure the Ceilometer service, see Polling properties.
Create the
SecretCR for the service with the custom configuration file:$ oc create secret generic <secret_name> \ --from-file <custom_config.yaml> -n openstack
NoteYou can specify the
--from-fileoption as many times as required to pass more than one configuration file to theSecretCR for customizing the service.Verify that the
SecretCR is created:$ oc describe secret <secret_name>
-
Open the
OpenStackControlPlaneCR file on your workstation, for example,openstack_control_plane.yaml. Locate the service definition for
telemetryand add thecustomConfigsSecretNamefield to the Telemetry service that you want to customize. The following example shows where to place thecustomConfigsSecretNamefield for each service that you can customize:spec: telemetry: template: autoscaling: aodh: customConfigsSecretName: <secret_name> ceilometer: customConfigsSecretName: <secret_name> cloudkitty: cloudKittyAPI: customConfigsSecretName: <secret_name> cloudKittyProc: customConfigsSecretName: <secret_name>Update the control plane:
$ oc apply -f openstack_control_plane.yaml -n openstack
Your custom file is copied from the
Secretinto the/etc/<service>/folder. If a file with the same name already exists in the folder, it is replaced with the custom configuration file.Wait until RHOCP creates the resources related to the
OpenStackControlPlaneCR. Run the following command to check the status:$ oc get openstackcontrolplane -n openstack
Verify that the custom configuration file is being used by the service:
$ oc rsh -c <service_container> <service_pod> \ cat /etc/<service>/<custom_config>.yaml
Example:
$ oc rsh -c ceilometer-central-agent ceilometer-0 cat /etc/ceilometer/polling.yaml
1.3.1. Polling properties
You can configure polling rules to poll for data not provided by service events and notifications, such as instance resource usage. Use the polling.yaml file to specify the polling plugins (pollsters) to enable, the interval they should be polled, and the meters to poll for each source.
The following template describes how to define a source to poll:
sources:
- name: <source_name>
interval: <sample_generation_interval>
meters:
- <meter_filter>
- <meter_filter>
...
- <meter_filter>-
interval: The interval in seconds between sample generation of the specified metrics. -
meters: A list of resources to gather samples from. Each filter must match the meter name of the polling plugin.
1.4. Enabling Telemetry power monitoring on the data plane
This feature is available in this release as a Technology Preview, and therefore is not fully supported by Red Hat. It should only be used for testing, and should not be deployed in a production environment. For more information about Technology Preview features, see Scope of Coverage Details.
You can enable power monitoring on the data plane to collect power consumption metrics, by adding the telemetry-power-monitoring service to each OpenStackDataPlaneNodeSet custom resource (CR) defined for the data plane.
Procedure
-
Open the
OpenStackDataPlaneNodeSetCR definition file for the node set you want to update, for example,openstack_data_plane.yaml. Add the
servicesfield, and include all the required services, including the default services, then addtelemetry-power-monitoringaftertelemetry:apiVersion: dataplane.openstack.org/v1beta1 kind: OpenStackDataPlaneNodeSet metadata: name: openstack-data-plane namespace: openstack spec: tlsEnabled: true env: - name: ANSIBLE_FORCE_COLOR value: "True" services: - redhat - bootstrap - download-cache - configure-network - validate-network - install-os - configure-os - ssh-known-hosts - run-os - reboot-os - install-certs - ovn - neutron-metadata - libvirt - nova - telemetry - telemetry-power-monitoringFor more information about deploying data plane services, see Deploying the data plane in the Deploying Red Hat OpenStack Services on OpenShift guide.
-
Save the
OpenStackDataPlaneNodeSetCR definition file. Apply the updated
OpenStackDataPlaneNodeSetCR configuration:$ oc apply -f openstack_data_plane.yaml
Verify that the data plane resource has been updated by confirming that the status is
SetupReady:$ oc wait openstackdataplanenodeset openstack-data-plane --for condition=SetupReady --timeout=10m
When the status is
SetupReadythe command returns acondition metmessage, otherwise it returns a timeout error.For information about the data plane conditions and states, see Data plane conditions and states in Deploying Red Hat OpenStack Services on OpenShift.
Create a file on your workstation to define the
OpenStackDataPlaneDeploymentCR:apiVersion: dataplane.openstack.org/v1beta1 kind: OpenStackDataPlaneDeployment metadata: name: <node_set_deployment_name>
-
Replace
<node_set_deployment_name>with the name of theOpenStackDataPlaneDeploymentCR. The name must be unique, must consist of lower case alphanumeric characters,-(hyphen) or.(period), and must start and end with an alphanumeric character.
TipGive the definition file and the
OpenStackDataPlaneDeploymentCR unique and descriptive names that indicate the purpose of the modified node set.-
Replace
Add the
OpenStackDataPlaneNodeSetCR that you modified:spec: nodeSets: - <nodeSet_name>-
Save the
OpenStackDataPlaneDeploymentCR deployment file. Deploy the modified
OpenStackDataPlaneNodeSetCR:$ oc create -f openstack_data_plane_deploy.yaml -n openstack
You can view the Ansible logs while the deployment executes:
$ oc get pod -l app=openstackansibleee -w $ oc logs -l app=openstackansibleee -f --max-log-requests 10
If the
oc logscommand returns an error similar to the following error, increase the--max-log-requestsvalue:error: you are attempting to follow 19 log streams, but maximum allowed concurrency is 10, use --max-log-requests to increase the limit
Verify that the modified
OpenStackDataPlaneNodeSetCR is deployed:$ oc get openstackdataplanedeployment -n openstack NAME STATUS MESSAGE openstack-data-plane True Setup Complete $ oc get openstackdataplanenodeset -n openstack NAME STATUS MESSAGE openstack-data-plane True NodeSet Ready
For information about the meaning of the returned status, see Data plane conditions and states in the Deploying Red Hat OpenStack Services on OpenShift guide.
If the status indicates that the data plane has not been deployed, then troubleshoot the deployment. For information, see Troubleshooting the data plane creation and deployment in the Deploying Red Hat OpenStack Services on OpenShift guide.
Verify that the
telemetry-power-monitoringservice is deployed by checking forceilometer_agent_ipmiandkeplercontainers in the data plane nodes:$ podman ps | grep -i -e ceilometer_agent_ipmi -e kepler
Chapter 2. Enabling RHOSO observability logging
Enable and configure Red Hat OpenStack Services on OpenShift (RHOSO) observability logging to collect, store, and access logs from your RHOSO environment. When observability logging is enabled on the control plane, you can enable the RHOSO observability logging service on the data plane.
2.1. Prerequisites
-
The Loki Operator is installed and started by creating a
LokiStackinstance. For more information, see Installing the Loki Operator by using the CLI in the Red Hat OpenShift Logging Installing logging guide. - The Red Hat OpenShift Logging Operator is installed. For more information about installing the Red Hat OpenShift Logging Operator, see Installing Red Hat OpenShift Logging Operator by using the CLI in the Red Hat OpenShift Logging Installing Logging guide.
-
An instance of the Red Hat OpenShift Logging Operator is started for the control plane by creating a
ClusterLogForwarderthat is configured for application logs retrieval. For an example of how to configure theClusterLogForwarderinstance for application logs retrieval, see the Content from github.com is not included.Application logs retrieval configuration example. -
An instance of the Red Hat OpenShift Logging Operator is started for the data plane by creating a
ClusterLogForwarderthat is configured with aSyslogreceiver. For an example of how to configure theClusterLogForwarderinstance with a `Syslog`receiver, see the Content from github.com is not included.Syslog receiver configuration example. - The logging plugin is installed to enable the logging tab in the observability dashboard. For more information, see Installing the logging UI plugin by using the CLI.
2.2. Enabling RHOSO observability logging on the control plane
To enable and configure observability logging on the control plane, you edit the Telemetry service in your OpenStackControlPlane custom resource (CR) file.
Procedure
-
Open the
OpenStackControlPlaneCR definition file,openstack_control_plane.yaml, on your workstation. Update the
telemetrysection based on the needs of your environment:telemetry: enabled: true template: ... logging: enabled: true annotations: metallb.universe.tf/address-pool: internalapi metallb.universe.tf/allow-shared-ip: internalapi metallb.universe.tf/loadBalancerIPs: 172.17.0.80-
logging.enabled: Set totrueto enable observability logging. -
logging.annotations.metallb.universe.tf/address-pool: Set to the RHOSO network that you want to use to transport the logs from the Compute nodes to the control plane. -
logging.annotations.metallb.universe.tf/loadBalancerIPs: Set to the IP address thatrsyslogsends messages to. Ensure that the IP address is reachable from the Compute node. The default IP address is the default VIP forinternalapi, which is 172.17.0.80.
-
Update the control plane:
$ oc apply -f openstack_control_plane.yaml -n openstack
Wait until RHOCP creates the resources related to the
OpenStackControlPlaneCR. Run the following command to check the status:$ oc get openstackcontrolplane -n openstack NAME STATUS MESSAGE openstack-control-plane Unknown Setup started
The OpenStackControlPlane resources are created when the status is "Setup complete".
TipAppend the
-woption to the end of thegetcommand to track deployment progress
Verification
- Open the logging pane in the OpenShift Console.
- Select Observe → Logs.
- From the drop-down menu, choose Application.
- Verify that pod logs from the RHOSO control plane services are present.
Additional resources
2.3. Enabling RHOSO observability logging on the data plane
You can enable Red Hat OpenStack Services on OpenShift (RHOSO) observability logging on the data plane by adding the OpenStackDataPlaneService logging service to the services list of each OpenStackDataPlaneNodeSet custom resource (CR) defined for the data plane.
Prerequisites
- RHOSO observability logging is enabled on the control plane.
Procedure
-
Open the
OpenStackDataPlaneNodeSetCR definition file for the node set you want to update, for example,openstack_data_plane.yaml. Add the
servicesfield, and include all the required services, including the default services, then addloggingaftertelemetry:apiVersion: dataplane.openstack.org/v1beta1 kind: OpenStackDataPlaneNodeSet metadata: name: openstack-data-plane namespace: openstack spec: tlsEnabled: true env: - name: ANSIBLE_FORCE_COLOR value: "True" services: - redhat - bootstrap - download-cache - configure-network - validate-network - install-os - configure-os - ssh-known-hosts - run-os - reboot-os - install-certs - ovn - neutron-metadata - libvirt - nova - telemetry - logging-
Save the
OpenStackDataPlaneNodeSetCR definition file. Apply the updated
OpenStackDataPlaneNodeSetCR configuration:$ oc apply -f openstack_data_plane.yaml
Verify that the data plane resource has been updated by confirming that the status is
SetupReady:$ oc wait openstackdataplanenodeset openstack-data-plane --for condition=SetupReady --timeout=10m
When the status is
SetupReadythe command returns acondition metmessage, otherwise it returns a timeout error.For information about the data plane conditions and states, see Data plane conditions and states in Deploying Red Hat OpenStack Services on OpenShift.
Create a file on your workstation to define the
OpenStackDataPlaneDeploymentCR:apiVersion: dataplane.openstack.org/v1beta1 kind: OpenStackDataPlaneDeployment metadata: name: <node_set_deployment_name>
-
Replace
<node_set_deployment_name>with the name of theOpenStackDataPlaneDeploymentCR. The name must be unique, must consist of lower case alphanumeric characters,-(hyphen) or.(period), and must start and end with an alphanumeric character.
TipGive the definition file and the
OpenStackDataPlaneDeploymentCR unique and descriptive names that indicate the purpose of the modified node set.-
Replace
Add the
OpenStackDataPlaneNodeSetCR that you modified:spec: nodeSets: - <nodeSet_name>-
Save the
OpenStackDataPlaneDeploymentCR deployment file. Deploy the modified
OpenStackDataPlaneNodeSetCR:$ oc create -f openstack_data_plane_deploy.yaml -n openstack
You can view the Ansible logs while the deployment executes:
$ oc get pod -l app=openstackansibleee -w $ oc logs -l app=openstackansibleee -f --max-log-requests 10
If the
oc logscommand returns an error similar to the following error, increase the--max-log-requestsvalue:error: you are attempting to follow 19 log streams, but maximum allowed concurrency is 10, use --max-log-requests to increase the limit
Verify that the modified
OpenStackDataPlaneNodeSetCR is deployed:$ oc get openstackdataplanedeployment -n openstack NAME STATUS MESSAGE openstack-data-plane True Setup Complete $ oc get openstackdataplanenodeset -n openstack NAME STATUS MESSAGE openstack-data-plane True NodeSet Ready
For information about the meaning of the returned status, see Data plane conditions and states in the Deploying Red Hat OpenStack Services on OpenShift guide.
If the status indicates that the data plane has not been deployed, then troubleshoot the deployment. For information, see Troubleshooting the data plane creation and deployment in the Deploying Red Hat OpenStack Services on OpenShift guide.
If you added a new node to the node set, then map the node to the Compute cell it is connected to:
$ oc rsh nova-cell0-conductor-0 nova-manage cell_v2 discover_hosts --verbose
If you did not create additional cells, this command maps the Compute nodes to
cell1.Access the remote shell for the
openstackclientpod and verify that the deployed Compute nodes are visible on the control plane:$ oc rsh -n openstack openstackclient $ openstack hypervisor list
Verification
- Open the logging pane in the OpenShift Console.
-
Click
Observeand then clickLogs. -
From the drop-down menu, choose
Infrastructure. -
Verify that
Journaldlogs from the Compute nodes are present.
Additional resources
Chapter 3. Enabling cloud rating in a RHOSO environment
You can use the Rating service (cloudkitty) to monitor, collect, rate and store metrics data about service and resource consumption in a RHOSO cloud.
The content for this feature is available in this release as a Documentation Preview, and therefore is not fully verified by Red Hat. Use it only for testing, and do not use in a production environment.
3.1. Prerequisites
- The Loki Operator is installed on the Red Hat OpenShift Container Platform (RHOCP) cluster. For information, see Installing the Loki Operator by using the CLI in the Red Hat OpenShift Logging Installing logging guide.
The Telemetry Operator uses a Keystone fetcher to retrieve a list of projects to rate and a Prometheus collector to collect data from a Prometheus source for a given project and metric. The Telemetry Operator does not support using a Gnocchi collector or any other type of fetcher.
The Telemetry Operator does not support the Rating service reprocessing API.
3.2. Enabling the Rating service on the control plane
You can enable the Rating service (cloudkitty) in the Telemetry Operator to provide chargeback and rating capabilities to RHOSO clouds.
Procedure
Create a
SecretCR on your workstation to connect Loki to the AWS S3 object storage you use for theLokilog store:apiVersion: v1 kind: Secret metadata: name: cloudkitty-loki-s3 namespace: openstack stringData: access_key_id: <your_key_id> access_key_secret: <your_key_secret> bucketnames: <your_bucket_name> endpoint: <your_s3_endpoint>
Create the
SecretCR in the cluster:$ oc create -f loki_s3_secret.yaml -n openstack
Verify that the
SecretCR is created:$ oc describe secret logging-loki-s3 -n openstack
-
Open the
OpenStackControlPlaneCR file on your workstation, for example,openstack_control_plane.yaml. Locate the service definition for
telemetryand check that Prometheus is enabled in your cluster by ensuring thatmetricStorageis enabled:telemetry: enabled: true template: metricStorage: enabled: trueFor more information about configuring
metricStorage, see Configuring observability.Check how long metrics data is being retained for, and ensure that they are not being stored for more than 7 days:
template: metricStorage: enabled: true ... monitoringStack: ... storage: strategy: persistent retention: 24hAdd the following
cloudkittyconfiguration to enable the Ratings service:telemetry: enabled: true template: ... cloudkitty: enabled: true s3StorageConfig: schemas: - effectiveDate: "2024-11-18" version: v13 secret: name: cloudkitty-loki-s3 type: s3- Ensure that the Loki Operator log store is sized for your environment. For more information, see Loki deployment sizing.
Update the control plane:
$ oc apply -f openstack_control_plane.yaml -n openstack
Wait until RHOCP creates the resources related to the
OpenStackControlPlaneCR. Run the following command to check the status:$ oc get openstackcontrolplane -n openstack NAME STATUS MESSAGE openstack-control-plane Unknown Setup started
The
OpenStackControlPlaneresources are created when the status is "Setup complete".TipAppend the
-woption to the end of thegetcommand to track deployment progress.Confirm that the control plane is deployed by reviewing the pods in the
openstacknamespace:$ oc get pods -n openstack
The control plane is deployed when all the pods are either completed or running.
Verification
Access the remote shell for the
openstackclientpod and verify that the Rating service is enabled:$ oc rsh -n openstack openstackclient $ openstack rating module list +-----------+---------+----------+ | Module | Enabled | Priority | +-----------+---------+----------+ | hashmap | False | 1 | | noop | True | 1 | | pyscripts | False | 1 | +-----------+---------+----------+
Verify that you can enable, disable and change the priority of each module:
Enable the hashmap rating module:
$ openstack rating module enable hashmap +---------+---------+----------+ | Module | Enabled | Priority | +---------+---------+----------+ | hashmap | True | 1 | +---------+---------+----------+
Disable the pyscripts rating module:
$ openstack rating module disable pyscripts +-----------+---------+----------+ | Module | Enabled | Priority | +-----------+---------+----------+ | pyscripts | False | 1 | +-----------+---------+----------+
Set the hashmap rating module priority to 100:
$ openstack rating module set priority hashmap 100 +---------+---------+----------+ | Module | Enabled | Priority | +---------+---------+----------+ | hashmap | True | 100 | +---------+---------+----------+
3.3. Configuring metrics for rating rules
Rating rules assign a value to the consumption of computing resources. Use a custom metrics.yaml file to define the specific metrics that the Rating service (cloudkitty) processes for your deployment.
Procedure
Copy the default
metrics.yamlconfiguration file to define the rating metrics to collect for your deployment:$ oc rsh cloudkitty-proc-0 cat /etc/cloudkitty/metrics.yaml > metrics.yaml
NoteYou must name your custom file
metrics.yamlso that your custom file replaces the existing configuration file.Retrieve a list of the names of the available metrics that you can configure:
$ oc rsh openstackclient openstack metric list
Define how each metric is collected. The following example defines how to collect the
ceilometer_image_sizemetric:metrics: ceilometer_image_size: unit: MiB groupby: - resource - project metadata: - container_format - disk_format extra_args: aggregation_method: max-
ceilometer_image_size: Specifies the name of the metric to collect. The name must match the name of the metric in Prometheus. -
unit: Specifies a string to indicate the unit in which the metric is stored after conversion. This string is for information only, it has no impact on the metric conversion or rating. -
groupby: Specifies the name of an attribute by which the metrics should be grouped on collection. You can specify as manygroupbyattributes as required. For example, you can group data by ID, project ID (project), domain ID and user ID (user).
For more information on metric rating rules you can configure, see Defining the Rating service metrics to collect.
-
Optional: If you need to convert the collected quantity from one unit to another, for example, from MiB to GiB, configure the
factorandoffsetoptions. Thefactorandoffsetoptions are used to calculate the final collected quantity with the following formula:quantity = collected_quantity * factor + offset.
For example, to convert the collected quantity from B to MiB:
metrics: ceilometer_image_size: unit: MiB # Final unit ... factor: 1/1048576 # Dividing by 1024 * 1024To rate different aspects of the same resource type that is collected through the same metric, you can add additional rating types to a single metric. For example, to rate the use of both the flavor and the used CPU time by an instance, define two units for the
ceilometer_cpumetric:ceilometer_cpu: - unit: instance alt_name: flavor groupby: - resource - user - project metadata: - flavor_name - flavor_id mutate: NUMBOOL extra_args: aggregation_method: max - unit: ns alt_name: cpu_time groupby: - resource - user - project mutate: NUMBOOL extra_args: aggregation_method: maxCreate the
SecretCR for the Rating service with the custom configuration files:$ oc create secret generic cloudkitty_secret \ --from-file metrics.yaml -n openstack
NoteYou can specify the
--from-fileoption as many times as required to pass more than one configuration file to theSecretCR for customizing the service.Verify that the
SecretCR is created:$ oc describe secret cloudkitty_secret
-
Open the
OpenStackControlPlaneCR file on your workstation, for example,openstack_control_plane.yaml. Locate the service definition for
telemetryand add thecustomConfigsSecretNamefield to thecloudkittyservice:spec: telemetry: template: ... cloudkitty: cloudKittyAPI: customConfigsSecretName: cloudkitty_secret cloudKittyProc: customConfigsSecretName: cloudkitty_secretUpdate the control plane:
$ oc apply -f openstack_control_plane.yaml -n openstack
Your custom
metrics.yamlfile is copied from theSecretinto the/etc/cloudkitty/folder, replacing the default configuration file.Wait until RHOCP creates the resources related to the OpenStackControlPlane CR. Run the following command to check the status:
$ oc get openstackcontrolplane -n openstack
Verify that the custom configuration file is being used by the service:
$ oc rsh -c cloudkitty-proc cloudkitty-proc-0 \ cat /etc/cloudkitty/metrics.yaml
3.4. Defining the Rating service metrics to collect
The metrics to collect by the Rating service (cloudkitty) are configured in the metrics.yaml file.
The following template describes how to define the metrics to collect:
metrics:
<metric_name>:
unit: <unit_indicator>
alt_name: <alternative_name>
description: <metric_description>
groupby:
- <groupby_attribute>
- <groupby_attribute>
...
- <groupby_attribute>
metadata:
- <metadata_attribute>
factor: <unit_conversion_factor>
offset: <unit_conversion_offset>
mutate: <unit_mutation>
extra_args:
- <list_of_optional_arguments>Replace
<metric_name>with the name of the metric to collect. This must match the name of the metric in Prometheus, for example,ceilometer_cpu. Use the following command to retrieve the names of the available metrics that you can configure:$ oc rsh openstackclient openstack metric list
-
unit: A string that indicates the unit in which the metric is stored after conversion. This string is for information only, it has no impact on the metric conversion or rating. For example, the unit could be set toGiBfor the metricceilometer_disk_root_size. -
groupby: A list of attribute names by which the metrics should be grouped on collection. For example, you can group data by ID, project ID (project), domain ID and user ID (user). -
metadata: A list of additional attribute names that should be collected for the given metric, that the metrics are not to be grouped on. For example, you can collect thetypeattribute for theceilometer_disk_root_sizemetric. The metadata attributes can be used for rating rules and they appear in summary reports. -
factor: An optional factor to use in calculating the conversion rate for the unit quantity. The default factor is1. You can specify the factor as a float, integer or fraction. -
offset: An optional offset to use in calculating the conversion rate for the unit quantity. The default offset is0. You can specify the offset as a float, integer or fraction. mutate: An optional operation to apply to the unit quantity after the conversion rate is calculated. You can use one of the following valid values:-
NONE: This is the default. The collected data is not modified. -
CEIL: The quantity is rounded up to the closest integer. -
FLOOR: The quantity is rounded down to the closest integer. -
NUMBOOL: If the collected quantity equals 0, leave it at 0. Else, set it to 1. -
NOTNUMBOOL: If the collected quantity equals 0, set it to 1. Else, set it to 0. MAP: Map arbitrary values to new values as defined through themutate_mapoption (dictionary). If the value is not found in mutate_map, set it to 0. If mutate_map is not defined or is empty, all values are set to 0. For example, the following Prometheus metric has a value of 0 when the instance is in ACTIVE state, but operators may want to rate other non-zero states:metrics: openstack_nova_server_status: unit: instance mutate: MAP mutate_map: 0.0: 1.0 # ACTIVE 11.0: 1.0 # SHUTOFF 12.0: 1.0 # SUSPENDED 16.0: 1.0 # PAUSED groupby: - id metadata: - flavor_id
-
-
alt_name: An optional string to use as the display name for the metric instead of the metric name in summary reports. -
description: An optional field for providing details about the configured rating type. The value is of type string up to 64 kB. When configured, this option is persisted as rating metadata and it is available through the summary GET API. extra_args: An optional list of arguments to apply to the rating. You can use the following valid arguments:-
aggregation_method: Defaults to max. The aggregation method to use when retrieving measures from Prometheus. Must be one ofavg,min,max,sum,count,stddev,stdvar. -
query_function: The function to apply to an instant vector after the aggregation_method or range_function has altered the data. Must be one ofabs,ceil,exp,floor,ln,log2,log10,round,sqrt. -
query_prefix: An arbitrary prefix to add to the Prometheus query generated by the Rating service, separated by a space. -
query_suffix: An arbitrary suffix to add to the Prometheus query generated by the Rating service, separated by a space. -
range_function: The function to apply instead of the implicit{aggregation_method}_over_time. Must be one ofchanges,delta,deriv,idelta,irange,irate,rate.
-
Chapter 4. RHOSO network observability
In Red Hat OpenStack Services on OpenShift (RHOSO) environments, you can use Red Hat OpenShift Container Platform (RHOCP) observability features and the RHOSO Telemetry service to access metrics for many aspects of your RHOSO deployment.
The Telemetry service includes Ceilometer and Prometheus.
Prometheus collects metrics from various components. It scrapes exporter endpoints on data plane nodes and control plane pods. You can configure Prometheus to drive a variety of functions including data logging, metrics viewing and queries on the RHOCP Observability dashboard, alarms, and autoscaling.
The RHOSO OpenStack network exporter (openstack-network-exporter) is a Prometheus exporter that runs on data plane nodes and control plane pods and collects OVN, OVS, and OVS-DPDK networking metrics to be scraped by Prometheus.
The RHOSO OpenStack network exporter collects a variety of metrics by default. You can access them through the RHOCP console dashboards. You can also scrape the network exporter endpoints with tools such as curl.
In addition to the default metrics collection, you can enable and disable the collection of extended metrics sets.
4.1. Available network metrics
The network exporter exposes OVN Controller metrics, OVS metrics, and OVS-DPDK PMD metrics. For example:
- OVN southbound database server metrics examples
- ovn_raft_cluster_server_vote: A metric with a constant 1 value labeled by database name, cluster uuid, server uuid and vote.
- ovn_raft_cluster_inbound_connections_total: total Number of inbound connections to the server labeled by database name, cluster uuid, and server uuid.
- ovn_raft_cluster_leader: Value 1.0 if the server is the cluster leader for the given database or 0.0 if it is not, labeled by database name, cluster uuid, and server uuid.
- OVN northd metrics examples
- ovn_northd_pstream_open_total: Number of times that passive connections were opened for the remote peer to connect.
- ovn_northd_txn_aborted_total: Number of times the OVSDB transaction has been aborted.
- OVN Controller metrics examples
- ovnc_lflow_run: The number of times ovn-controller translated the Logical_Flow table in the OVN SB database into OpenFlow flows in the interval.
- ovn_txn_error: number of times the OVSDB transaction has errored out.
- OVN logical router and logical router port metrics examples
- ovnc_router_port_traffic_pkts: Number of packets transmitted and received by a logical router port labeled by the logical datapath number and the logical port number.
- ovnc_router_port_traffic_bytes: Number of bytes transmitted and received by a logical router port labeled by the logical datapath number and the logical port number
- OVS metrics examples
- ovs_bridge_port_count: Number of ports in a bridge, labeled by bridge name
- ovs_bridge_flow_count: Number of openflow rules configured on a bridge, labeled by bridge name
- OVS-DPDK PMD metrics examples
- ovs_pmd_tx_packets: Rate of PMD packets transmits during the interval
- ovs_pmd_rx_packets: Rate of PMD packets received during the interval
You can view all of the available metrics and descriptions by running curl commands on a Compute node or control plane pod. For more information, see Accessing live network metrics from the command line with curl.
4.2. Accessing network metrics snapshots with the RHOCP console
In Red Hat OpenStack Services on OpenShift (RHOSO) environments, you can view changes to a subset of the available networking metrics in charts on the RHOCP Console.
Prerequisites
-
You have access to the RHOCP console as a
cluster-adminuser. For more information about RHOCP dashboards, see Reviewing monitoring dashboards as a cluster administrator in the RHOCP Monitoring Guide.
Procedure
-
On the RHOCP Console, go to
Observe > Dashboards. -
Choose
Dashboard > Openstack/Openstack Network Exporter. - (Optional) Select a Compute node from the Instance menu.
- (Optional) Select a Time Range and Refresh Interval.
4.3. Accessing live network metrics from the command line with curl
The OpenStack network exporter pushes metrics to endpoints. You can log in to a Compute node and use the curl command to get metrics from those endpoints.
Prerequisites
-
The administrator has created a project for you and has provided you with a
clouds.yamlfile for you to access the cloud. The
python-openstackclientpackage resides on your workstation.$ dnf list installed python-openstackclient
- You can log into Compute nodes.
Procedure
- Log into a Compute node.
Example: To view all available metrics, run the command:
curl -k https://localhost:9105/metrics | more
Example: To view link state metrics, run the command:
curl -k -s https://localhost:9105/metrics | grep ovs_interface_link_state
- Sample output
# HELP ovs_interface_link_state The link state of the interface. Possible values are: up(1), down(0) or unknown(-1). # TYPE ovs_interface_link_state gauge ovs_interface_link_state{bridge="br-ex",interface="br-ex",port="br-ex",type="internal"} 1 ovs_interface_link_state{bridge="br-ex",interface="eth1",port="eth1",type="system"} 1 ovs_interface_link_state{bridge="br-ex",interface="patch-provnet-e0353205-c937-4ebb-af35-a7db0d85c9d3-to-br-int",port="patch-provnet-e0353205-c937-4ebb-af35-a7db0d85c9d3-to-br-int",type="patch"} 1 ovs_interface_link_state{bridge="br-ex",interface="vlan20",port="vlan20",type="internal"} 1 ovs_interface_link_state{bridge="br-ex",interface="vlan21",port="vlan21",type="internal"} 1 ovs_interface_link_state{bridge="br-ex",interface="vlan22",port="vlan22",type="internal"} 1 ovs_interface_link_state{bridge="br-ex",interface="vlan23",port="vlan23",type="internal"} 1 ovs_interface_link_state{bridge="br-int",interface="br-int",port="br-int",type="internal"} 0 ovs_interface_link_state{bridge="br-int",interface="ovn-3cc1e2-0",port="ovn-3cc1e2-0",type="geneve"} 1 ovs_interface_link_state{bridge="br-int",interface="ovn-737fe9-0",port="ovn-737fe9-0",type="geneve"} 1 ovs_interface_link_state{bridge="br-int",interface="ovn-7488f5-0",port="ovn-7488f5-0",type="geneve"} 1 ovs_interface_link_state{bridge="br-int",interface="ovn-87e045-0",port="ovn-87e045-0",type="geneve"} 1 ovs_interface_link_state{bridge="br-int",interface="ovn-acb76a-0",port="ovn-acb76a-0",type="geneve"} 1 ovs_interface_link_state{bridge="br-int",interface="patch-br-int-to-provnet-e0353205-c937-4ebb-af35-a7db0d85c9d3",port="patch-br-int-to-provnet-e0353205-c937-4ebb-af35-a7db0d85c9d3",type="patch"} 1 ovs_interface_link_state{bridge="br-int",interface="tap32b26818-e0",port="tap32b26818-e0",type="system"} 1 ovs_interface_link_state{bridge="br-int",interface="tapba83f025-aa",port="tapba83f025-aa",type="system"} 1
4.4. Enabling and disabling extended metrics
Enable extended poll mode driver (PMD) metrics to debug your RHOSO networks, and then disable the metrics when you are finished.
Extended PMD metrics collection is disabled by default because PMD metrics collection can increase CPU load and impact performance.
Prerequisites
-
You can log into Compute nodes as
root.
Procedure
- Log in to a Compute node.
Reset and enable extended PMD metrics:
# ovs-appctl dpif-netdev/pmd-stats-clear # ovs-vsctl set Open_vSwitch . other_config:pmd-perf-metrics=true
Disable extended PMD metrics:
# ovs-vsctl set Open_vSwitch . other_config:pmd-perf-metrics=false
Chapter 5. Monitor RHOSO networks with alerts
You can create alerts to monitor your RHOSO networks based on any of the OVS, OVN, and OVS-DPDK metrics available through the RHOSO OpenStack network exporter (openstack-network-exporter).
For example, you can create an alert that activates when an interface resets (flaps) more than three times in five minutes. This alert would monitor the ovs_interface_link_resets metric.
In your RHOSO environment, you use standard RHOSO, OpenShift, and Prometheus monitoring toolkit management tools to create, manage, and consume alerts.
When you deploy a RHOSO environment as instructed in Creating the control plane in Deploying Red Hat OpenStack Services on OpenShift, the deployment is already enabled with the ability to create alerts.
5.1. Consuming RHOSO network alerts
You can receive alerts about your RHOSO environment, including OVS and OVN networking alerts, in several ways:
- Configure web pages to display them. For more information, see Content from prometheus.io is not included.Notification integrations in the Prometheus documentation set. Also see Creating an alert route in Alertmanager in Alerts in the Telemetry Operator.
-
Run
curlcommands from the command line on a control plane or data plane node. - Configure alerts by creating routes to external channels such as web sites, chat and email platforms, text and pager systems.
5.2. RHOSO alert structure and management
RHOSO network alerts are managed by the open source monitoring and alerting toolkit Prometheus. Prometheus includes a time-series database, methods for fetching metrics from endpoints in your deployment, and the Alertmanager feature for managing alerts.
RHOSO uses the Cluster Observability Operator (COO) to deploy and manage these tools. The COO is deployed automatically in your environment by the OpenStack Operator.
You create alerts within custom resources (CR) that define Prometheus rule objects. Within a Prometheus rule object you create groups of rules. Each group contains one or more rules. Each rule contains one or more alerts. Each alert contains a PromQL expression, which uses the Prometheus query language to define the conditions that fire the alert.
Alert states include pending, firing, and inactive.
5.2.1. Alert rule structure example in a CR file
The following example shows the general structure of a PrometheusRule object definition, including two groups.
PrometheusRule object
groups:
- name: group1
rules:
- alert: AlertName1
expr: ... <PromQL expression>
- alert: AlertName2
expr: ...
- name: group2
rules:
- alert: ...5.3. PromQL expressions use metrics to define alert conditions
The functional center of an alert rule is a Prometheus Query Language (PromQL) expression. You can use PromQL to select, aggregate, and analyze environmental data in real time in the Prometheus monitoring system.
You can create a PromQL expression using any available metrics, including the OVS, OVN, and OVS-DPDK network metrics exposed by the RHOSO OpenStack network exporter (openstack-network-exporter). In the following example, the PromQL expression in the expr field defines an alert that fires in response to excessive interface resets.
rules:
- alert: OVSInterfaceLinkFlappingWarning
expr: |
(
increase(ovs_interface_link_resets[5m]) > 3
)
for: 1m
For information on the OVS and OVN network metrics exposed by the openstack-network-exporter, see RHOSO network observability.
5.4. Silence RHOSO alerts
You can silence an alert for a specific time period. For example, you can silence an alert during a scheduled maintenance period, or to stop notifications when you are troubleshooting a known issue. The silenced alert fires, but Alertmanager does not send notifications.
For more information, see Managing alerts in Red Hat OpenShift Container Platform Monitoring.
5.5. Create RHOSO network alerts
To set up alerts that notify you of important operational conditions, you create alert groups, alert rules, alerts, and alert expressions in custom resource definitions (CRD). The CRD uses apiVersion: monitoring.rhobs/v1 and kind: PrometheusRule.
Prerequisites
- The Red Hat OpenStack Services on OpenShift (RHOSO) environment is deployed on a Red Hat OpenShift Container Platform (RHOCP) cluster. For more information, see Deploying Red Hat OpenStack Services on OpenShift.
- You are logged on to a workstation that has access to the RHOCP cluster, as a user with cluster-admin privileges.
- The Telemetry service is enabled and configured on the control plane. For more information, see the telemetry service configuration example under Add the following service configurations in Creating the control plane in Deploying Red Hat OpenStack Services on OpenShift.
Procedure
Create a file on your workstation to define the PrometheusRule CR. For example,
openstack-observability-services-alerts.yaml.apiVersion: monitoring.rhobs/v1 kind: PrometheusRule metadata: labels: service: metricStorage name: openstack-observability-services-alerts namespace: openstack spec: groups: - name: openstack-observability.ovs.interface rules: - alert: OVSInterfaceLinkFlappingWarning expr: | ( increase(ovs_interface_link_resets[5m]) > 3 ) for: 1m labels: severity: warning annotations: summary: "OVS interface link flapping (warning)" description: | Interface {{ $labels.interface }} on {{ $labels.fqdn }} has more than 3 link resets in the last 5 minutes. Bridge: {{ $labels.bridge }}Create the PrometheusRule object:
$ oc create -f openstack-observability-services-alerts.yaml
The Cluster Observability Operator (COO) loads the rule into Prometheus.
Verify that the COO loaded the rules into Prometheus:
$ oc get prometheusrules.monitoring.rhobs -n openstack
NoteYou must pass the entire CRD name,
prometheusrules.monitoring.rhobs, because there is a different PrometheusRule CRD that provides the rules for the RHOCP Monitoring API,prometheusrules.monitoring.coreos.