Observability

Red Hat OpenShift Serverless 1.37

Observability features including administrator and developer metrics, cluster logging, and tracing

Red Hat OpenShift Documentation Team

Abstract

This document provides details on how to monitor the performance of Knative services. It also details how to use OpenShift Logging and OpenShift distributed tracing with OpenShift Serverless.

Chapter 1. Administrator metrics

1.1. OpenShift Serverless administrator metrics

Metrics enable cluster administrators to monitor how OpenShift Serverless cluster components and workloads are performing.

To view different metrics for OpenShift Serverless, you can check the OpenShift Container Platform monitoring documentation.

1.1.1. Prerequisites for OpenShift Serverless administrator metrics

You have enabled metrics for your cluster.
You have access to an account with cluster administrator access (or dedicated administrator access for OpenShift Dedicated or Red Hat OpenShift Service on AWS).

Warning

If Service Mesh is enabled with mTLS, metrics for Knative Serving are disabled by default because Service Mesh prevents Prometheus from scraping metrics.

To resolve this issue, see the Enabling Knative Serving metrics when using Service Mesh with mTLS section in the Service Mesh integration documentation.

Scraping the metrics does not affect autoscaling of a Knative service, because scraping requests do not go through the activator. so, no scraping takes place if no pods are running.

1.1.2. Additional resources

1.2. Serverless controller metrics

Any component that implements controller logic emits the following metrics. These metrics show details about reconciliation operations and the work queue behavior that adds reconciliation requests to the queue.

Metric name	Description	Type	Tags	Unit
`work_queue_depth`	The depth of the work queue.	Gauge	`reconciler`	Integer (no units)
`reconcile_count`	The number of reconcile operations.	Counter	`reconciler`, `success`	Integer (no units)
`reconcile_latency`	The latency of reconcile operations.	Histogram	`reconciler`, `success`	Milliseconds
`workqueue_adds_total`	The total number of add actions handled by the work queue.	Counter	`name`	Integer (no units)
`workqueue_queue_latency_seconds`	The length of time an item stays in the work queue before being requested.	Histogram	`name`	Seconds
`workqueue_retries_total`	The total number of retries that have been handled by the work queue.	Counter	`name`	Integer (no units)
`workqueue_work_duration_seconds`	The length of time it takes to process and item from the work queue.	Histogram	`name`	Seconds
`workqueue_unfinished_work_seconds`	The length of time that outstanding work queue items have been in progress.	Histogram	`name`	Seconds
`workqueue_longest_running_processor_seconds`	The length of time that the longest outstanding work queue items has been in progress.	Histogram	`name`	Seconds

1.3. Webhook metrics

Webhook metrics report useful information about operations. For example, if a large number of operations fail, this might indicate an issue with a user-created resource.

Metric name	Description	Type	Tags	Unit
`request_count`	The number of requests that are routed to the webhook.	Counter	`admission_allowed`, `kind_group`, `kind_kind`, `kind_version`, `request_operation`, `resource_group`, `resource_namespace`, `resource_resource`, `resource_version`	Integer (no units)
`request_latencies`	The response time for a webhook request.	Histogram	`admission_allowed`, `kind_group`, `kind_kind`, `kind_version`, `request_operation`, `resource_group`, `resource_namespace`, `resource_resource`, `resource_version`	Milliseconds

1.4. Knative Eventing metrics

Cluster administrators can view the following metrics for Knative Eventing components. By aggregating the metrics from HTTP code, you can separate the events into two categories; successful events (2xx) and failed events (5xx).

1.4.1. Broker ingress metrics

You can use the following metrics to debug the broker ingress, evaluate its performance, and identify that events the ingress component dispatches.

Metric name	Description	Type	Tags	Unit
`event_count`	Number of events received by a broker.	Counter	`broker_name`, `event_type`, `namespace_name`, `response_code`, `response_code_class`, `unique_name`	Integer (no units)
`event_dispatch_latencies`	The time taken to dispatch an event to a channel.	Histogram	`broker_name`, `event_type`, `namespace_name`, `response_code`, `response_code_class`, `unique_name`	Milliseconds

1.4.2. Broker filter metrics

You can use the following metrics to debug broker filters, evaluate their performance, and confirm that the filters dispatch events. You can also measure the latency of event filtering.

Metric name	Description	Type	Tags	Unit
`event_count`	Number of events received by a broker.	Counter	`broker_name`, `container_name`, `filter_type`, `namespace_name`, `response_code`, `response_code_class`, `trigger_name`, `unique_name`	Integer (no units)
`event_dispatch_latencies`	The time taken to dispatch an event to a channel.	Histogram	`broker_name`, `container_name`, `filter_type`, `namespace_name`, `response_code`, `response_code_class`, `trigger_name`, `unique_name`	Milliseconds
`event_processing_latencies`	The time required to process an event before dispatching it to a trigger subscriber.	Histogram	`broker_name`, `container_name`, `filter_type`, `namespace_name`, `trigger_name`, `unique_name`	Milliseconds

1.4.3. InMemoryChannel dispatcher metrics

You can use the following metrics to debug InMemoryChannel channels, evaluate their performance, and identify the events that the channels dispatch.

Metric name	Description	Type	Tags	Unit
`event_count`	Number of events dispatched by `InMemoryChannel` channels.	Counter	`broker_name`, `container_name`, `filter_type`, `namespace_name`, `response_code`, `response_code_class`, `trigger_name`, `unique_name`	Integer (no units)
`event_dispatch_latencies`	The time taken to dispatch an event from an `InMemoryChannel` channel.	Histogram	`broker_name`, `container_name`, `filter_type`, `namespace_name`, `response_code`, `response_code_class`, `trigger_name`, `unique_name`	Milliseconds

1.4.4. Event source metrics

You can use the following metrics to verify that the event source delivered events to the connected event sink.

Metric name	Description	Type	Tags	Unit
`event_count`	Number of events sent by the event source.	Counter	`broker_name`, `container_name`, `filter_type`, `namespace_name`, `response_code`, `response_code_class`, `trigger_name`, `unique_name`	Integer (no units)
`retry_event_count`	The event source retries and sends events that failed during the initial delivery trial.	Counter	`event_source`, `event_type`, `name`, `namespace_name`, `resource_group`, `response_code`, `response_code_class`, `response_error`, `response_timeout`	Integer (no units)

1.4.5. Knative Kafka broker metrics

You can use the following metrics to debug and visualize the performance of Kafka broker.

Metric name	Description	Type	Tags	Unit
`event_count_1_total{job="kafka-broker-receiver-sm-service", namespace="knative-eventing"}`	Number of events received by a broker	Counter	`name` broker name `namespace_name` broker namespace `event_type` event type `response_code` HTTP response code returned by the broker `response_code_class` HTTP response code class returned by the broker: 2xx, 3xx, 4xx, 5xx	Dimensionless
`event_dispatch_latencies_ms_bucket{job="kafka-broker-receiver-sm-service", namespace="knative-eventing"}`	The time spent dispatching an event to a Kafka cluster	Histogram	`name` broker name `namespace_name` broker namespace `event_type` event type `response_code` HTTP response code returned by the broker `response_code_class` HTTP response code class returned by the broker: 2xx, 3xx, 4xx, 5xx	Milliseconds
`kafka_broker_controller_consumer_group_expected_replicas`	Number of expected replicas for a given Kafka consumer group resource	Gauge	`consumer_name` resource name `namespace_name` resource namespace `consumer_kind` resource Kind, enum: `KafkaSource`, `Trigger`, `Subscription` Note In this context, resources refer to user facing entities such as Kafka source, trigger, and subscription. Avoid using internal or generated names when using these resources.	Dimensionless
`kafka_broker_controller_consumer_group_ready_replicas`	Number of ready replicas for a given Kafka consumer group resource	Gauge	`consumer_name` resource name `namespace_name` resource namespace `consumer_kind` resource Kind, enum: `KafkaSource`, `Trigger`, `Subscription` Note In this context, resources refer to user facing entities such as Kafka source, trigger, and subscription. Avoid using internal or generated names when using these resources.	Dimensionless

1.4.6. Knative Kafka trigger metrics

You can use the following metrics to debug and visualize the performance of Kafka triggers.

Metric name	Description	Type	Tags	Unit
`event_count_1_total{job="kafka-broker-dispatcher-sm-service", namespace="knative-eventing"}`	Number of events dispatched by a trigger to a subscriber	Counter	`consumer_name` trigger name `namespace_name` trigger namespace `name` broker name `event_type` event type `response_code` HTTP response code returned by the trigger subscriber service `response_code_class` HTTP response code class returned by the trigger subscriber service: 2xx, 3xx, 4xx, 5xx	Dimensionless
`event_dispatch_latencies_ms_bucket{job="kafka-broker-dispatcher-sm-service", namespace="knative-eventing"}`	The time spent dispatching an event to a subscriber	Histogram	`consumer_name` trigger name `namespace_name` trigger namespace `name` broker name `event_type` event type `response_code` HTTP response code returned by the trigger subscriber service `response_code_class` HTTP response code class returned by the trigger subscriber service: 2xx, 3xx, 4xx, 5xx	Milliseconds
`event_processing_latencies_ms_bucket{job="kafka-broker-dispatcher-sm-service", namespace="knative-eventing"}`	The time spent processing and filtering an event	Histogram	`consumer_name` trigger name `namespace_name` trigger namespace `name` broker name `event_type` event type	Milliseconds

1.4.7. Knative Kafka channel metrics

You can use the following metrics to debug and visualize the performance of Kafka channel.

Metric name	Description	Type	Tags	Unit
`event_count_1_total{job="kafka-channel-receiver-sm-service", namespace="knative-eventing"}`	Number of events received by a Kafka channel	Counter	`name` Kafka channel name `namespace_name` Kafka channel namespace `event_type` event type `response_code` HTTP response code returned by the Kafka channel `response_code_class` HTTP response code class returned by the Kafka channel: 2xx, 3xx, 4xx, 5xx	Dimensionless
`event_dispatch_latencies_ms_bucket{job="kafka-channel-receiver-sm-service", namespace="knative-eventing"}`	The time spent dispatching an event to a Kafka cluster	Histogram	`name` Kafka channel name `namespace_name` Kafka channel namespace `event_type` event type `response_code` HTTP response code returned by the Kafka channel `response_code_class` HTTP response code class returned by the Kafka channel: 2xx, 3xx, 4xx, 5xx	Milliseconds

1.4.8. Knative Kafka subscription metrics

You can use the following metrics to debug and visualize the performance of subscriptions associated with the Kafka channel.

Metric name	Description	Type	Tags	Unit
`event_count_1_total{job="kafka-channel-dispatcher-sm-service", namespace="knative-eventing"}`	Number of events dispatched by a subscription to a subscriber	Counter	`consumer_name` Subscription name `namespace_name` Subscription namespace `name` `KafkaChannel` name `event_type` event type `response_code` HTTP response code returned by the `Subscription` subscriber service `response_code_class` HTTP response code class returned by the `Subscription` subscriber service: 2xx, 3xx, 4xx, 5xx	Dimensionless
`event_dispatch_latencies_ms_bucket{job="kafka-channel-dispatcher-sm-service", namespace="knative-eventing"}`	The time spent dispatching an event to a subscriber	Histogram	`consumer_name` Subscription name `namespace_name` Subscription namespace `name` `KafkaChannel` name `event_type` event type `response_code` HTTP response code returned by the `Subscription` subscriber service `response_code_class` HTTP response code class returned by the `Subscription` subscriber service: 2xx, 3xx, 4xx, 5xx	Milliseconds
`event_processing_latencies_ms_bucket{job="kafka-channel-dispatcher-sm-service", namespace="knative-eventing"}`	The time spent processing an event	Histogram	`consumer_name` Subscription name `namespace_name` Subscription namespace `name` `KafkaChannel` name `event_type` event type	Dimensionless

1.4.9. Knative Kafka source metrics

You can use the following metrics to debug and visualize the performance of Kafka sources.

Metric name	Description	Type	Tags	Unit
`event_count_1_total{job="kafka-source-dispatcher-sm-service", namespace="knative-eventing"}`	Number of events dispatched by a Kafka source	Counter	`consumer_name` Kafka source name `namespace_name` Kafka source namespace `name` Kafka source name `event_type` event type `response_code` HTTP response code returned by the Kafka source sink service `response_code_class` HTTP response code class returned by the Kafka source sink service: 2xx, 3xx, 4xx, 5xx	Dimensionless
`event_dispatch_latencies_ms_bucket{job="kafka-source-dispatcher-sm-service", namespace="knative-eventing"}`	The time spent dispatching an event to a sink	Histogram	`consumer_name` Kafka source name `namespace_name` Kafka source namespace `name` Kafka source name `event_type` event type `response_code` HTTP response code returned by the Kafka source sink service `response_code_class` HTTP response code class returned by the Kafka source sink service: 2xx, 3xx, 4xx, 5xx	Milliseconds
`event_processing_latencies_ms_bucket{job="kafka-source-dispatcher-sm-service", namespace="knative-eventing"}`	The time spent processing an event	Histogram	`consumer_name` Kafka source name `namespace_name` Kafka source namespace `name` Kafka source name `event_type` event type	Milliseconds
`kafka_broker_controller_consumer_group_expected_replicas`	Number of expected replicas for a given Kafka consumer group resource	Gauge	`consumer_name` resource name `namespace_name` resource namespace `consumer_kind` resource Kind, enum: `KafkaSource`, `Trigger`, `Subscription` Note In this context, resources refer to user facing entities such as Kafka source,trigger, and subscription. Avoid using internal or generated names when using these resources.	Dimensionless
`kafka_broker_controller_consumer_group_ready_replicas`	Number of ready replicas for a given Kafka consumer group resource	Gauge	`consumer_name` resource name `namespace_name` resource namespace `consumer_kind` resource Kind, enum: `KafkaSource`, `Trigger`, `Subscription` Note In this context, resources refer to user facing entities such as Kafka source,trigger, and subscription. Avoid using internal or generated names when using these resources.	Dimensionless

1.4.10. Knative Kafka sink metrics

You can use the following metrics to debug and visualize the performance of Kafka sinks.

Metric name	Description	Type	Tags	Unit
`event_count_1_total{job="kafka-sink-receiver-sm-service", namespace="knative-eventing"}`	Number of events received by a broker	Counter	`name` Kafka sink name `namespace_name` Kafka sink namespace `event_type` event type `response_code` HTTP response code returned by the Kafka sink `response_code_class` HTTP response code class returned by the Kafka sink: 2xx, 3xx, 4xx, 5xx	Dimensionless
`event_dispatch_latencies_ms_bucket{job="kafka-sink-receiver-sm-service", namespace="knative-eventing"}`	The time spent dispatching an event to a Kafka cluster	Histogram	`name` Kafka sink name `namespace_name` Kafka sink namespace `event_type` event type `response_code` HTTP response code returned by the Kafka sink `response_code_class` HTTP response code class returned by the Kafka sink: 2xx, 3xx, 4xx, 5xx	Milliseconds

1.5. Knative Serving metrics

Cluster administrators can view the following metrics for Knative Serving components.

1.5.1. Activator metrics

You can use the following metrics to understand how applications respond when traffic passes through the activator.

Metric name	Description	Type	Tags	Unit
`request_concurrency`	The number of concurrent requests that routes to the activator, or average concurrency over a reporting period.	Gauge	`configuration_name`, `container_name`, `namespace_name`, `pod_name`, `revision_name`, `service_name`	Integer (no units)
`request_count`	The number of requests that route to the activator. The activator handler processes these requests.	Counter	`configuration_name`, `container_name`, `namespace_name`, `pod_name`, `response_code`, `response_code_class`, `revision_name`, `service_name`,	Integer (no units)
`request_latencies`	The response time in milliseconds for a fulfilled, routed request.	Histogram	`configuration_name`, `container_name`, `namespace_name`, `pod_name`, `response_code`, `response_code_class`, `revision_name`, `service_name`	Milliseconds

1.5.2. Autoscaler metrics

The autoscaler component exposes several metrics related to autoscaler behavior for each revision. For example, you can monitor the number of pods that the autoscaler targets for a service. You can also monitor the average requests per second during the stable window and whether the autoscaler enters panic mode when using the Knative pod autoscaler (KPA).

Metric name	Description	Type	Tags	Unit
`desired_pods`	The number of pods the autoscaler tries to assign for a service.	Gauge	`configuration_name`, `namespace_name`, `revision_name`, `service_name`	Integer (no units)
`excess_burst_capacity`	The excess burst capacity served over the stable window.	Gauge	`configuration_name`, `namespace_name`, `revision_name`, `service_name`	Integer (no units)
`stable_request_concurrency`	The average number of requests for each observed pod over the stable window.	Gauge	`configuration_name`, `namespace_name`, `revision_name`, `service_name`	Integer (no units)
`panic_request_concurrency`	The average number of requests for each observed pod over the panic window.	Gauge	`configuration_name`, `namespace_name`, `revision_name`, `service_name`	Integer (no units)
`target_concurrency_per_pod`	The number of concurrent requests that the autoscaler tries to send to each pod.	Gauge	`configuration_name`, `namespace_name`, `revision_name`, `service_name`	Integer (no units)
`stable_requests_per_second`	The average number of requests-per-second for each observed pod over the stable window.	Gauge	`configuration_name`, `namespace_name`, `revision_name`, `service_name`	Integer (no units)
`panic_requests_per_second`	The average number of requests-per-second for each observed pod over the panic window.	Gauge	`configuration_name`, `namespace_name`, `revision_name`, `service_name`	Integer (no units)
`target_requests_per_second`	The number of requests-per-second that the autoscaler targets for each pod.	Gauge	`configuration_name`, `namespace_name`, `revision_name`, `service_name`	Integer (no units)
`panic_mode`	This value is `1` if the autoscaler is in panic mode, or `0` if the autoscaler is not in panic mode.	Gauge	`configuration_name`, `namespace_name`, `revision_name`, `service_name`	Integer (no units)
`requested_pods`	The number of pods that the autoscaler has requested from the Kubernetes cluster.	Gauge	`configuration_name`, `namespace_name`, `revision_name`, `service_name`	Integer (no units)
`actual_pods`	The number of pods the system allocates that are currently ready.	Gauge	`configuration_name`, `namespace_name`, `revision_name`, `service_name`	Integer (no units)
`not_ready_pods`	The number of pods that have a not ready state.	Gauge	`configuration_name`, `namespace_name`, `revision_name`, `service_name`	Integer (no units)
`pending_pods`	The number of pods that are currently pending.	Gauge	`configuration_name`, `namespace_name`, `revision_name`, `service_name`	Integer (no units)
`terminating_pods`	The number of pods that are currently terminating.	Gauge	`configuration_name`, `namespace_name`, `revision_name`, `service_name`	Integer (no units)

1.5.3. Go runtime metrics

Each Knative Serving control plane process emits many Go runtime memory statistics, as defined in MemStats.

Note

The name tag for each metric is an empty tag.

Metric name	Description	Type	Tags	Unit
`go_alloc`	The number of bytes of allocated heap objects. This metric is the same as `heap_alloc`.	Gauge	`name`	Integer (no units)
`go_total_alloc`	The cumulative bytes allocated for heap objects.	Gauge	`name`	Integer (no units)
`go_sys`	The total bytes of memory obtained from the operating system.	Gauge	`name`	Integer (no units)
`go_lookups`	The number of pointer lookups performed by the runtime.	Gauge	`name`	Integer (no units)
`go_mallocs`	The cumulative count of heap objects allocated.	Gauge	`name`	Integer (no units)
`go_frees`	The cumulative count of heap objects that are free.	Gauge	`name`	Integer (no units)
`go_heap_alloc`	The number of bytes of allocated heap objects.	Gauge	`name`	Integer (no units)
`go_heap_sys`	The number of bytes of heap memory obtained from the operating system.	Gauge	`name`	Integer (no units)
`go_heap_idle`	The number of bytes in idle, unused spans.	Gauge	`name`	Integer (no units)
`go_heap_in_use`	The number of bytes in spans that are currently in use.	Gauge	`name`	Integer (no units)
`go_heap_released`	The number of bytes of physical memory returned to the operating system.	Gauge	`name`	Integer (no units)
`go_heap_objects`	The number of allocated heap objects.	Gauge	`name`	Integer (no units)
`go_stack_in_use`	The number of bytes in stack spans that are currently in use.	Gauge	`name`	Integer (no units)
`go_stack_sys`	The number of bytes of stack memory obtained from the operating system.	Gauge	`name`	Integer (no units)
`go_mspan_in_use`	The number of bytes of allocated `mspan` structures.	Gauge	`name`	Integer (no units)
`go_mspan_sys`	The number of bytes of memory obtained from the operating system for `mspan` structures.	Gauge	`name`	Integer (no units)
`go_mcache_in_use`	The number of bytes of allocated `mcache` structures.	Gauge	`name`	Integer (no units)
`go_mcache_sys`	The number of bytes of memory obtained from the operating system for `mcache` structures.	Gauge	`name`	Integer (no units)
`go_bucket_hash_sys`	The number of bytes of memory in profiling bucket hash tables.	Gauge	`name`	Integer (no units)
`go_gc_sys`	The number of bytes of memory in garbage collection metadata.	Gauge	`name`	Integer (no units)
`go_other_sys`	The number of bytes of memory in miscellaneous, off-heap runtime allocations.	Gauge	`name`	Integer (no units)
`go_next_gc`	The target heap size of the next garbage collection cycle.	Gauge	`name`	Integer (no units)
`go_last_gc`	The time that the last garbage collection was completed.	Gauge	`name`	Nanoseconds
`go_total_gc_pause_ns`	The cumulative time in garbage collection stop-the-world pauses since the program started.	Gauge	`name`	Nanoseconds
`go_num_gc`	The number of completed garbage collection cycles.	Gauge	`name`	Integer (no units)
`go_num_forced_gc`	The number of garbage collection cycles that were forced due to an application calling the garbage collection function.	Gauge	`name`	Integer (no units)
`go_gc_cpu_fraction`	The fraction of the available CPU time of the program that has been used by the garbage collector since the program started.	Gauge	`name`	Integer (no units)

1.5.4. Additional resources

Content from golang.org is not included.MemStats

Chapter 2. Developer metrics

2.1. OpenShift Serverless developer metrics overview

Metrics enable developers to monitor how Knative services are performing. You can use the OpenShift Container Platform monitoring stack to record and view health checks and metrics for your Knative services.

To view different metrics for OpenShift Serverless, you can check the OpenShift Container Platform monitoring documentation.

Warning

If Service Mesh is enabled with mTLS, metrics for Knative Serving are disabled by default because Service Mesh prevents Prometheus from scraping metrics.

To resolve this issue, see the Enabling Knative Serving metrics when using Service Mesh with mTLS section in the Service Mesh integration documentation.

Scraping the metrics does not affect autoscaling of a Knative service, because scraping requests do not go through the activator. So, no scraping takes place if no pods are running.

2.1.1. Additional resources

2.2. Knative service metrics exposed by default

Knative services expose a set of default metrics that give insights into request traffic, performance, and system behavior.

2.2.1. Default metrics for Knative services

The following table describes the default metrics that Knative services expose on port 9091, including their units, types, descriptions, and metric tags.

Knative services expose the following metrics by default on port 9091.

Table 2.1. Metrics exposed by default for each Knative service on port 9091

Metric name, unit, and type	Description	Metric tags
`request_count` Metric unit: dimensionless Metric type: counter	The number of requests that are routed to `queue-proxy`.	configuration_name="event-display", container_name="queue-proxy", namespace_name="apiserversource1", pod_name="event-display-00001-deployment-658fd4f9cf-qcnr5", response_code="200", response_code_class="2xx", revision_name="event-display-00001", service_name="event-display"
`request_latencies` Metric unit: milliseconds Metric type: histogram	The response time in milliseconds.	configuration_name="event-display", container_name="queue-proxy", namespace_name="apiserversource1", pod_name="event-display-00001-deployment-658fd4f9cf-qcnr5", response_code="200", response_code_class="2xx", revision_name="event-display-00001", service_name="event-display"
`app_request_count` Metric unit: dimensionless Metric type: counter	The number of requests that are routed to `user-container`.	configuration_name="event-display", container_name="queue-proxy", namespace_name="apiserversource1", pod_name="event-display-00001-deployment-658fd4f9cf-qcnr5", response_code="200", response_code_class="2xx", revision_name="event-display-00001", service_name="event-display"
`app_request_latencies` Metric unit: milliseconds Metric type: histogram	The response time in milliseconds.	configuration_name="event-display", container_name="queue-proxy", namespace_name="apiserversource1", pod_name="event-display-00001-deployment-658fd4f9cf-qcnr5", response_code="200", response_code_class="2xx", revision_name="event-display-00001", service_name="event-display"
`queue_depth` Metric unit: dimensionless Metric type: gauge	The current number of items in the serving and waiting queue, or not reported if unlimited concurrency. `breaker.inFlight` is used.	configuration_name="event-display", container_name="queue-proxy", namespace_name="apiserversource1", pod_name="event-display-00001-deployment-658fd4f9cf-qcnr5", response_code="200", response_code_class="2xx", revision_name="event-display-00001", service_name="event-display"

2.3. Knative service with custom application metrics

You can extend the set of metrics exported by a Knative service. The exact implementation depends on your application and the language used.

2.3.1. Go application example for exporting custom metrics

The following section provides a sample Go application that exports a custom metric to track the number of processed events by using Prometheus.

The following example shows a Go application that exposes a custom Prometheus metric for the total number of processed events:

package main

import (
  "fmt"
  "log"
  "net/http"
  "os"

  "github.com/prometheus/client_golang/prometheus"
  "github.com/prometheus/client_golang/prometheus/promauto"
  "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
  opsProcessed = promauto.NewCounter(prometheus.CounterOpts{
     Name: "myapp_processed_ops_total",
     Help: "The total number of processed events",
  })
)


func handler(w http.ResponseWriter, r *http.Request) {
  log.Print("helloworld: received a request")
  target := os.Getenv("TARGET")
  if target == "" {
     target = "World"
  }
  fmt.Fprintf(w, "Hello %s!\n", target)
  opsProcessed.Inc()
}

func main() {
  log.Print("helloworld: starting server...")

  port := os.Getenv("PORT")
  if port == "" {
     port = "8080"
  }

  http.HandleFunc("/", handler)

  // Separate server for metrics requests
  go func() {
     mux := http.NewServeMux()
     server := &http.Server{
        Addr: fmt.Sprintf(":%s", "9095"),
        Handler: mux,
     }
     mux.Handle("/metrics", promhttp.Handler())
     log.Printf("prometheus: listening on port %s", 9095)
     log.Fatal(server.ListenAndServe())
  }()

   // Use same port as normal requests for metrics
  //http.Handle("/metrics", promhttp.Handler())
  log.Printf("helloworld: listening on port %s", port)
  log.Fatal(http.ListenAndServe(fmt.Sprintf(":%s", port), nil))
}

github.com/prometheus/client_golang/prometheus: Including the Prometheus packages.
opsProcessed = promauto.NewCounter: Defining the opsProcessed metric.
opsProcessed.Inc(): Incrementing the opsProcessed metric.
go func(): Configuring to use a separate server for metrics requests.
http.Handle: Configuring to use the same port as normal requests for metrics and the metrics subpath.

2.4. Configuration for scraping custom metrics

Custom metrics scraping is performed by an instance of Prometheus purposed for user workload monitoring. After you enable user workload monitoring and create the application, you need a configuration that defines how the monitoring stack will scrape the metrics.

2.4.1. Knative service metrics scraping configuration example

This reference provides a sample configuration for a Knative service and a ServiceMonitor resource to enable metrics scraping for an application.

The following example defines a Knative service and configures a ServiceMonitor resource to scrape metrics. The exact configuration depends on the application and how it exports metrics.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
spec:
  template:
    metadata:
      labels:
        app: helloworld-go
      annotations:
    spec:
      containers:
      - image: docker.io/skonto/helloworld-go:metrics
        resources:
          requests:
            cpu: "200m"
        env:
        - name: TARGET
          value: "Go Sample v1"
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
  name: helloworld-go-sm
spec:
  endpoints:
  - port: queue-proxy-metrics
    scheme: http
  - port: app-metrics
    scheme: http
  namespaceSelector: {}
  selector:
    matchLabels:
       name:  helloworld-go-sm
---
apiVersion: v1 1
kind: Service
metadata:
  labels:
    name:  helloworld-go-sm
  name:  helloworld-go-sm
spec:
  ports:
  - name: queue-proxy-metrics
    port: 9091
    protocol: TCP
    targetPort: 9091
  - name: app-metrics
    port: 9095
    protocol: TCP
    targetPort: 9095
  selector:
    serving.knative.dev/service: helloworld-go
  type: ClusterIP

apiVersion: serving.knative.dev/v1: Application specification.
apiVersion: monitoring.coreos.com/v1: Configuration of which application’s metrics are scraped.
apiVersion: v1: Configuration of the way metrics are scraped.

2.5. Examining metrics of a service

After you have configured the application to export the metrics and the monitoring stack to scrape them, you can examine the metrics in the web console.

2.5.1. Viewing Knative service metrics in the web console

The following procedure describes how to view and query Knative service and application metrics by using the OpenShift Container Platform web console.

Prerequisites

You have logged in to the OpenShift Container Platform web console.
You have installed the OpenShift Serverless Operator and Knative Serving.

Procedure

Optional: Run requests against your application that you will be able to see in the metrics:
```
$ hello_route=$(oc get ksvc helloworld-go -n ns1 -o jsonpath='{.status.url}') && \
    curl $hello_route
```
You get an output similar to the following example:
```
Hello Go Sample v1!
```
In the web console, navigate to the Observe → Metrics interface.
In the input field, enter the query for the metric you want to observe, for example:
```
revision_app_request_count{namespace="ns1", job="helloworld-go-sm"}
```
Another example:
```
myapp_processed_ops_total{namespace="ns1", job="helloworld-go-sm"}
```
Observe the visualized metrics.

2.5.2. Queue proxy metrics

Each Knative service has a proxy container that proxies the connections to the application container. Several metrics are reported for the queue proxy performance.

You can use the following metrics to measure if requests are queued at the proxy side and the actual delay in serving requests at the application side.

Metric name	Description	Type	Tags	Unit
`revision_request_count`	The number of requests that are routed to `queue-proxy` pod.	Counter	`configuration_name`, `container_name`, `namespace_name`, `pod_name`, `response_code`, `response_code_class`, `revision_name`, `service_name`	Integer (no units)
`revision_request_latencies`	The response time of revision requests.	Histogram	`configuration_name`, `container_name`, `namespace_name`, `pod_name`, `response_code`, `response_code_class`, `revision_name`, `service_name`	Milliseconds
`revision_app_request_count`	The number of requests that are routed to the `user-container` pod.	Counter	`configuration_name`, `container_name`, `namespace_name`, `pod_name`, `response_code`, `response_code_class`, `revision_name`, `service_name`	Integer (no units)
`revision_app_request_latencies`	The response time of revision app requests.	Histogram	`configuration_name`, `namespace_name`, `pod_name`, `response_code`, `response_code_class`, `revision_name`, `service_name`	Milliseconds
`revision_queue_depth`	The current number of items in the `serving` and `waiting` queues. This metric is not reported if unlimited concurrency is configured.	Gauge	`configuration_name`, `event-display`, `container_name`, `namespace_name`, `pod_name`, `response_code_class`, `revision_name`, `service_name`	Integer (no units)

2.6. Dashboard for service metrics

You can examine the metrics by using a dedicated dashboard that aggregates queue proxy metrics by namespace.

2.6.1. Examining metrics of a service in the dashboard

You can monitor the performance and behavior of Knative services by using the metrics dashboard in the OpenShift Container Platform web console. The dashboard displays queue proxy metrics that help you understand request patterns, latency, and throughput for your serverless applications.

Prerequisites

You have logged in to the OpenShift Container Platform web console.
You have installed the OpenShift Serverless Operator and Knative Serving.

Procedure

In the web console, navigate to the Observe → Metrics interface.
Select the Knative User Services (Queue Proxy metrics) dashboard.
Select the Namespace, Configuration, and Revision that correspond to your application.
Observe the visualized metrics.

Chapter 3. Cluster logging

3.1. Configuring log settings for Serving and Eventing

You can configure logging for OpenShift Serverless Serving and OpenShift Serverless Eventing by using the KnativeServing and KnativeEventing custom resource (CR). The specified loglevel value determines the logging level.

3.1.1. Supported log levels

The following loglevel values are supported:

Table 3.1. Supported log levels

Log level	Description
`debug`	Fine-grained debugging
`info`	Normal logging
`warn`	Unexpected but non-critical errors
`error`	Critical errors; unexpected during normal operation
`dpanic`	In debug mode, trigger a panic (crash)

Warning

Using the debug level for production might negatively affect performance.

3.1.2. Configuring log settings

You can configure logging for Serving and Eventing in the KnativeServing custom resource (CR) and KnativeEventing CR.

Procedure

Configure the log settings for Serving and Eventing by setting or modifying the loglevel value in the KnativeServing and KnativeEventing CR. Here are two example configurations with all possible logging options set to level info:

apiVersion: operator.knative.dev/v1beta1
kind: KnativeServing
metadata:
  name: knative-serving
  namespace: knative-serving
spec:
  config:
    logging:
      loglevel.controller: "info"
      loglevel.autoscaler: "info"
      loglevel.queueproxy: "info"
      loglevel.webhook: "info"
      loglevel.activator: "info"
      loglevel.hpaautoscaler: "info"
      loglevel.net-certmanager-controller: "info"
      loglevel.net-istio-controller: "info"
      loglevel.net-kourier-controller: "info"

apiVersion: operator.knative.dev/v1beta1
kind: KnativeEventing
metadata:
  name: knative-eventing
  namespace: knative-eventing
spec:
  config:
    logging:
      loglevel.controller: "info"
      loglevel.eventing-webhook: "info"
      loglevel.inmemorychannel-dispatcher: "info"
      loglevel.inmemorychannel-webhook: "info"
      loglevel.mt-broker-controller: "info"
      loglevel.mt_broker_filter: "info"
      loglevel.mt_broker_ingress: "info"
      loglevel.pingsource-mt-adapter: "info"

3.1.3. Configuring request log settings

You can configure request logging for your service in the observability field of your KnativeServing custom resource (CR).

For information about the available parameters for configuring request logging, see "Parameters of request logging".

Procedure

Configure request logging for your service by modifying the observability field in your KnativeServing CR:

You get an output similar to the following example:

apiVersion: operator.knative.dev/v1beta1
kind: KnativeServing
metadata:
  name: knative-serving
  namespace: knative-serving
# ...
spec:
  config:
    observability:
        logging.enable-request-log: true
        logging.enable-probe-request-log: true
        logging.request-log-template: '{"httpRequest": {"requestMethod": "{{.Request.Method}}", "requestUrl": "{{js .Request.RequestURI}}", "requestSize": "{{.Request.ContentLength}}", "status": {{.Response.Code}}, "responseSize": "{{.Response.Size}}", "userAgent": "{{js .Request.UserAgent}}", "remoteIp": "{{js .Request.RemoteAddr}}", "serverIp": "{{.Revision.PodIP}}", "referer": "{{js .Request.Referer}}", "latency": "{{.Response.Latency}}s", "protocol": "{{.Request.Proto}}"}, "traceId": "{{index .Request.Header "X-B3-Traceid"}}"}'
# ...

3.1.4. Parameters of request logging

The following table describes parameters used to configure request logging.

Table 3.2. Request logging configuration parameters

Parameter	Type	Description
`logging.enable-request-log`	Boolean (`true` or `false`)	Set to `true` to enable request logging.
`logging.enable-probe-request-log`	Boolean (`true` or `false`)	Set to `true` to enable the queue proxy to log probe requests to stdout. It uses the template specified in `logging.request-log-template`.
`logging.request-log-template`	Go `text/template` string	Decide the shape of the request logs. Use a single line to prevent logs from being split into many records.

The logging.request-log-template parameter includes the following functions:

Request is an http.Request representing an HTTP request received by the server.
Response represents the HTTP response and includes the following fields:
- Code is the HTTP status code.
- Size is the response size in bytes.
- Latency is the response latency in seconds.
Revision has revision details and includes the following fields:
- Name is the name of the revision.
- Namespace is the namespace of the revision.
- Service is the name of the service.
- Configuration is the name of the configuration.
- PodName is the name of the pod hosting the revision.
- PodIP is the IP address of the hosting pod.

Chapter 4. Tracing

4.1. Tracing requests

Distributed tracing records the path of a request through the various services that make up an application. It is used to tie information about different units of work together, to understand a whole chain of events in a distributed transaction. The units of work might be executed in different processes or hosts.

4.1.1. Distributed tracing overview

As a service owner, you can use Red Hat OpenShift distributed tracing to instrument your services to gather insights into your service architecture. You can use distributed tracing for monitoring, network profiling, and troubleshooting the interaction between components in modern, cloud-native, microservices-based applications.

With distributed tracing you can perform the following functions:

Monitor distributed transactions
Optimize performance and latency
Perform root cause analysis

Red Hat OpenShift distributed tracing consists of two main components:

Red Hat OpenShift distributed tracing platform - This component uses the open source Jaeger project.
Red Hat OpenShift distributed tracing data collection - This component uses the open source OpenTelemetry project.

Both components use vendor-neutral OpenTracing APIs and instrumentation.

4.1.2. Additional resources

4.2. Using Red Hat OpenShift distributed tracing

You can use Red Hat OpenShift distributed tracing with OpenShift Serverless to monitor and troubleshoot serverless applications.

4.2.1. Using Red Hat OpenShift distributed tracing to enable Red Hat OpenShift distributed tracing

Red Hat OpenShift distributed tracing is made up of several components that work together to collect, store, and display tracing data.

Prerequisites

You have access to an OpenShift Container Platform account with cluster administrator access.
You have installed Red Hat OpenShift distributed tracing by following the OpenShift Container Platform "Installing Red Hat OpenShift distributed tracing" documentation.
You have installed the OpenShift CLI (oc).
You have created a project or have access to a project with the appropriate roles and permissions to create applications and other workloads in OpenShift Container Platform.

Procedure

Create an OpenTelemetryCollector custom resource (CR):

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: cluster-collector
  namespace: <namespace>
spec:
  mode: deployment
  config: |
    receivers:
      zipkin:
    processors:
    exporters:
      jaeger:
        endpoint: jaeger-all-in-one-inmemory-collector-headless.tracing-system.svc:14250
        tls:
          ca_file: "/var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt"
      logging:
    service:
      pipelines:
        traces:
          receivers: [zipkin]
          processors: []
          exporters: [jaeger, logging]

Verify that you have two pods running in the namespace where Red Hat OpenShift distributed tracing is installed:

$ oc get pods -n <namespace>

You get an output similar to the following example:

NAME                                          READY   STATUS    RESTARTS   AGE
cluster-collector-collector-85c766b5c-b5g99   1/1     Running   0          5m56s
jaeger-all-in-one-inmemory-ccbc9df4b-ndkl5    2/2     Running   0          15m

Verify that the following headless services have been created:

$ oc get svc -n <namespace> | grep headless

You get an output similar to the following example:

cluster-collector-collector-headless            ClusterIP   None             <none>        9411/TCP                                 7m28s
jaeger-all-in-one-inmemory-collector-headless   ClusterIP   None             <none>        9411/TCP,14250/TCP,14267/TCP,14268/TCP   16m

These services are used to configure Jaeger, Knative Serving, and Knative Eventing. The name of the Jaeger service may vary.

Install the OpenShift Serverless Operator by following the "Installing the OpenShift Serverless Operator" documentation.

Install Knative Serving by creating the following KnativeServing CR:

apiVersion: operator.knative.dev/v1beta1
kind: KnativeServing
metadata:
    name: knative-serving
    namespace: knative-serving
spec:
  config:
    tracing:
      backend: "zipkin"
      zipkin-endpoint: "http://cluster-collector-collector-headless.tracing-system.svc:9411/api/v2/spans"
      debug: "false"
      sample-rate: "0.1"

sample-rate: The sample-rate defines sampling probability. Using sample-rate: "0.1" means that 1 in 10 traces are sampled.

Install Knative Eventing by creating the following KnativeEventing CR:

apiVersion: operator.knative.dev/v1beta1
kind: KnativeEventing
metadata:
    name: knative-eventing
    namespace: knative-eventing
spec:
  config:
    tracing:
      backend: "zipkin"
      zipkin-endpoint: "http://cluster-collector-collector-headless.tracing-system.svc:9411/api/v2/spans"
      debug: "false"
      sample-rate: "0.1" 1

sample-rate: The sample-rate defines sampling probability. Using sample-rate: "0.1" means that 1 in 10 traces are sampled.

Create a Knative service:

You get an output similar to the following example:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
spec:
  template:
    metadata:
      labels:
        app: helloworld-go
      annotations:
        autoscaling.knative.dev/minScale: "1"
        autoscaling.knative.dev/target: "1"
    spec:
      containers:
      - image: quay.io/openshift-knative/helloworld:v1.2
        imagePullPolicy: Always
        resources:
          requests:
            cpu: "200m"
        env:
        - name: TARGET
          value: "Go Sample v1"

Make some requests to the service:
```
$ curl https://helloworld-go.example.com
```
Get the URL for the Jaeger web console:
```
$ oc get route jaeger-all-in-one-inmemory  -o jsonpath='{.spec.host}' -n <namespace>
```
You can now examine traces by using the Jaeger console.

4.3. Using Jaeger distributed tracing

If you do not want to install all of the components of Red Hat OpenShift distributed tracing, you can still use distributed tracing on OpenShift Container Platform with OpenShift Serverless.

4.3.1. Configuring Jaeger to enable distributed tracing

To enable distributed tracing using Jaeger, you must install and configure Jaeger as a standalone integration.

Prerequisites

You have cluster administrator permissions on OpenShift Container Platform, or you have cluster or dedicated administrator permissions on Red Hat OpenShift Service on AWS or OpenShift Dedicated.
You have installed the OpenShift Serverless Operator, Knative Serving, and Knative Eventing.
You have installed the Red Hat OpenShift distributed tracing platform Operator.
You have installed the OpenShift CLI (oc).
You have created a project or have access to a project with the appropriate roles and permissions to create applications and other workloads.

Procedure

Create and apply a Jaeger custom resource (CR) that has the following:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: default

Enable tracing for Knative Serving, by editing the KnativeServing CR and adding a YAML configuration for tracing:
```
apiVersion: operator.knative.dev/v1beta1
kind: KnativeServing
metadata:
  name: knative-serving
  namespace: knative-serving
spec:
  config:
    tracing:
      sample-rate: "0.1"
      backend: zipkin
      zipkin-endpoint: "http://jaeger-collector.default.svc.cluster.local:9411/api/v2/spans"
      debug: "false"
```
sample-rate
The sample-rate defines sampling probability. Using sample-rate: "0.1" means that 1 in 10 traces are sampled.
backend: zipkin
You must set backend to zipkin.
zipkin-endpoint
The zipkin-endpoint must point to your jaeger-collector service endpoint. To get this endpoint, substitute the namespace where the Jaeger CR is applied.
debug
Debugging should be set to false. Enabling debug mode by setting debug: "true" allows all spans to be sent to the server, bypassing sampling.
Enable tracing for Knative Eventing by editing the KnativeEventing CR:
```
apiVersion: operator.knative.dev/v1beta1
kind: KnativeEventing
metadata:
  name: knative-eventing
  namespace: knative-eventing
spec:
  config:
    tracing:
      sample-rate: "0.1"
      backend: zipkin
      zipkin-endpoint: "http://jaeger-collector.default.svc.cluster.local:9411/api/v2/spans"
      debug: "false"
```
sample-rate
The sample-rate defines sampling probability. Using sample-rate: "0.1" means that 1 in 10 traces are sampled.
backend: zipkin
Set backend to zipkin.
zipkin-endpoint
Point the zipkin-endpoint to your jaeger-collector service endpoint. To get this endpoint, substitute the namespace where the Jaeger CR is applied.
debug
Debugging should be set to false. Enabling debug mode by setting debug: "true" allows all spans to be sent to the server, bypassing sampling.

Verification

You can access the Jaeger web console to see tracing data, by using the jaeger route.

Get the jaeger route’s hostname by entering the following command:

$ oc get route jaeger -n default

You get an output similar to the following example:

NAME     HOST/PORT                         PATH   SERVICES       PORT    TERMINATION   WILDCARD
jaeger   jaeger-default.apps.example.com          jaeger-query   <all>   reencrypt     None

Open the endpoint address in your browser to view the console.

Legal Notice

Except as otherwise noted below, the text of and illustrations in this documentation are licensed by Red Hat under the Creative Commons Attribution–Share Alike 3.0 Unported license . If you distribute this document or an adaptation of it, you must provide the URL for the original version.

Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.

Red Hat, the Red Hat logo, JBoss, Hibernate, and RHCE are trademarks or registered trademarks of Red Hat, LLC. or its subsidiaries in the United States and other countries.

Linux® is the registered trademark of Linus Torvalds in the United States and other countries.

XFS is a trademark or registered trademark of Hewlett Packard Enterprise Development LP or its subsidiaries in the United States and other countries.

The OpenStack® Word Mark and OpenStack logo are trademarks or registered trademarks of the Linux Foundation, used under license.

All other trademarks are the property of their respective owners.