How to collect cluster Prometheus metrics in Red Hat OpenShift Container Platform

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 3.11
    • 4
  • Prometheus
  • Metrics

Issue

  • How to collect cluster Prometheus metrics in Red Hat OpenShift Container Platform.
  • How to provide specific time range from a Prometheus DB.

Resolution

Complete collection

  • One option to capture cluster prometheus metrics is with the following script. Once the capture of data is complete, please share this data with Red Hat Technical Support via a support case.

      cat <<'EOF' > prometheus-metrics.sh
      #!/usr/bin/env bash
    
      function queue() {
        local TARGET="${1}"
        shift
        local LIVE
        LIVE="$(jobs | wc -l)"
        while [[ "${LIVE}" -ge 45 ]]; do
          sleep 1
          LIVE="$(jobs | wc -l)"
        done
        echo "${@}"
        if [[ -n "${FILTER:-}" ]]; then
          "${@}" | "${FILTER}" >"${TARGET}" &
        else
          "${@}" >"${TARGET}" &
        fi
      }
    
      ARTIFACT_DIR=$PWD
      mkdir -p $ARTIFACT_DIR/metrics
      echo "Snapshotting prometheus (may take 15s) ..."
      queue ${ARTIFACT_DIR}/metrics/prometheus.tar.gz oc --insecure-skip-tls-verify exec -n openshift-monitoring prometheus-k8s-0 -- tar cvzf - -C /prometheus .
      FILTER=gzip queue ${ARTIFACT_DIR}/metrics/prometheus-target-metadata.json.gz oc --insecure-skip-tls-verify exec -n openshift-monitoring prometheus-k8s-0 -- /bin/bash -c "curl -G http://localhost:9090/api/v1/targets/metadata --data-urlencode 'match_target={instance!=\"\"}'"
      wait
      EOF
      bash prometheus-metrics.sh
    

The script above may not work completely, as the generated tar file might fail to extract because the script copies files while Prometheus is actively working on them. Using the collection method below might be a better option.

Partial collection

  • If a specific time range from your Prometheus database needs to be selected, use the following commands to list the Block ULID, copy the directories, and then compress the folders.
  1. Get the list of time block chunks you wish to review:
  $ oc exec -n openshift-monitoring prometheus-k8s-0 -- promtool tsdb list -r /prometheus

  BLOCK ULID                  MIN TIME                       MAX TIME                       DURATION      NUM SAMPLES  NUM CHUNKS   NUM SERIES   SIZE
  01GGQV2KWQ7DX0RAHWZZFPCNTM  2022-10-31 17:17:39 +0000 UTC  2022-10-31 18:00:00 +0000 UTC  42m20.389s    15306022     215664       215457       46MiB747KiB476B
  01GGRRZ7B4VMK4PGENCKC642FK  2022-10-31 18:00:00 +0000 UTC  2022-11-01 00:00:00 +0000 UTC  5h59m59.811s  133853179    1122268      195484       161MiB1016KiB281B
  01GGSDJDEDKAVW84H85123N5DW  2022-11-01 00:00:00 +0000 UTC  2022-11-01 06:00:00 +0000 UTC  5h59m59.811s  135702215    1140143      199963       166MiB899KiB888B
  01GGT25KJ6HQ0SK622J57PNZ4M  2022-11-01 06:00:00 +0000 UTC  2022-11-01 12:00:00 +0000 UTC  5h59m59.811s  135950916    1147756      230258       190MiB936KiB527B
  01GGTPRRYWQXTNE8C4D3KD40WF  2022-11-01 12:00:00 +0000 UTC  2022-11-01 18:00:00 +0000 UTC  5h59m59.811s  135618858    1169963      224173       178MiB160KiB467B
  01GGVBBYM6R3MV21Q84KCAAMMH  2022-11-01 18:00:00 +0000 UTC  2022-11-02 00:00:00 +0000 UTC  5h59m59.811s  139185809    1165518      201794       172MiB206KiB895B
  01GGVZZ4CDD2FN548AZ5QWHYSF  2022-11-02 00:00:00 +0000 UTC  2022-11-02 06:00:00 +0000 UTC  5h59m59.811s  139566486    1171364      204422       169MiB81KiB250B
  01GGWMJBTHJEH4VQBC2361RTEH  2022-11-02 06:00:00 +0000 UTC  2022-11-02 12:00:00 +0000 UTC  5h59m59.811s  140284758    1175662      203168       172MiB559KiB886B
  1. From the previous output, update the below script (or execute the lines sequentially) to pull the blocks and chunk data for analysis (you must set the values for "blocks" to match your output first).
  #!/bin/bash
  ## IMPORTANT: set the block UUIDs to be copied first in a space-delimited list (depends upon your output above)

  blocks="01GGQV2KWQ7DX0RAHWZZFPCNTM 01GGRRZ7B4VMK4PGENCKC642FK"

  CAPTUREDIR=./data
  mkdir -p $CAPTUREDIR

  for i in $(echo $blocks); do mkdir -p $CAPTUREDIR/$i/chunks; done

  for i in $(echo $blocks); do for file in index meta.json tombstones; do oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- cat /prometheus/$i/$file >  $CAPTUREDIR/$i/$file; done; done

  for i in $(echo $blocks); do oc cp -n  openshift-monitoring prometheus-k8s-0:/prometheus/$i -c prometheus  $CAPTUREDIR/$i; done

  oc cp -n openshift-monitoring -c prometheus prometheus-k8s-0:chunks_head $CAPTUREDIR/chunks_head

  #Note capturing Wal segment may report "file changed as we read it" - we can ignore this error
  oc cp -n openshift-monitoring -c prometheus prometheus-k8s-0:wal $CAPTUREDIR/wal

  oc cp -n openshift-monitoring -c prometheus prometheus-k8s-0:queries.active $CAPTUREDIR/queries.active

  tar -zcvf prometheus-db.tar.gz ${CAPTUREDIR}

We should see in the output of these directories the following files:

~/Downloads/data tree
.
├── 01K94F3NNVR4YG19FB8E1YDZ7K # chunk directory for specific time selection
│   ├── chunks
│   │   └── 000001
│   ├── index
│   ├── meta.json
│   └── tombstones
├── 01K94NZZ6M1194YNR4KBFCFF27
│   ├── chunks
│   │   ├── 000001
│   │   └── 000002
│   ├── index
│   ├── meta.json
│   └── tombstones
├── chunks_head
├── queries.active
└── wal
    └── 00000000

Ensure that we aren't missing the index, meta.json or tombstones file for a given chunk directory - without these the chunk cannot be parsed, and you may need to go back to pull these files explicitly out of the pod. chunks_head and queries.active and wal is also required to parse the chunk blocks.

Cluster Observability Operator

In the case of needing to collect a Prometheus dump from the Cluster Observability Operator, use the following commands:
The Cluster Observability Operator Prometheus instance pods do not contain the tar binary

$ datadir=./data
$ mkdir -p $datadir

## set the block ULIDs to be copied:
$ blocks="01J0T747YMVTQHT5AK122TD9K8 01J0T08GPMBY81J1SHHRYSS0GF 01J0SSCSEKP4RRR5MK9DQ9S103"

$ for i in $(echo $blocks); do mkdir -p $datadir/$i/chunks; done

$ for i in $(echo $blocks); do for file in index meta.json tombstones; do oc exec -n $NAMESPACE prometheus-coo-monitoring-stack-0 -c prometheus -- cat /prometheus/$i/$file >  $datadir/$i/$file; done; done

$ for i in $(echo $blocks); do oc exec -n $NAMESPACE prometheus-coo-monitoring-stack-0 -c prometheus -- cat /prometheus/$i/chunks/000001 >  $datadir/$i/chunks/000001; done

$ tar zcvf prometheus-db.tar.gz ${datadir}

Opening Prometheus data for offline review with podman

This can be accomplished using podman and a simple launch script:

$ cat prometheus_viewer.sh

#!/bin/bash
#prometheus launcher 
#provided as-is with no warranty for use in troubleshooting or analysis of prometheus data bundles.
#
#script assumes you have podman installed, firefox is available, and that you can call the file ~/Downloads/data.
#script also assumes you have a pull-secret available to reference to acquire the correct runtime image.
#
#Running this script will prompt you first to confirm you have completed the requisite step of un-compressing 
#the tarball of the promql output into your downloads directory, and then will prompt for the exact clusterversion
#after which it will kick-start a container and open a web browser session to the locally running prometheus instance.

##-----Script start-----##
echo "this script assumes you have already expanded a tarball of the prometheus-db.tar.gz file to the directory: ~/Downloads/data"
echo "press return to continue or ctrl + c to abort and do that first"
read emptyvar1

#ensure we match the same version of promql from openshift that was used to generate the data
echo "insert clusterversion for the bundle you are reviewing"
read OCP_VERSION
IMG="$(oc adm release info --image-for=prometheus ${OCP_VERSION})"

#load pull-secret (can be obtained from https://console.redhat.com/openshift/downloads)"
PULL_SECRET=<path-to-pull-secret.txt-here>

### Retrieve the image URL and run the container (here using `-ti` option to run it in foreground mode)
### The script below will assume that our data directory is in ~/Downloads and load $PWD/data after going there. 
### You may need to adjust where the data directory is, and where the script takes you to execute below if different.
### Script will also try to launch firefox, change this to chrome or edge or your preferred browser executable 
### Or just open your webrowser at that page.

#navigate to ~/Downloads so we can reference $PWD/data
cd ~/Downloads/

#open firefox so the page is available and load the container as a session in this shell in the foreground.
echo "launching firefox in a new window and starting the container below - press ctrl+c to stop the process and remove the container when finished"

firefox http://localhost:9090/ &
podman run --rm --authfile=${PULL_SECRET} -it -u $(id -u):$(id -g)  -p 9090:9090 -v $PWD/data:/data:U,Z $IMG --storage.tsdb.path=/data --storage.tsdb.retention.time=999d --config.file=/dev/null

NOTE: Script requires a This content is not included.pull secret - edit the script above to include the path to this file locally on your machine before execution.

Usage:

  1. copy this script to your local machine that where podman and a local browser like firefox is available and un-compress the prometheus tarball in the path: ~/Downloads (should result in the folder: ~/Downloads/data being created)
  2. Modify the script to include a path to the local pull-secret file where the script can reference it
  3. Modify the script to select a different browser if not using firefox or to point to where the expanded data folder is.
  4. Execute the script, pressing return to acknowledge the first message, then insert the exact clusterversion (4.18.24 for example) to pull that image version.
  5. Open your browser to http://localhost:9090 (if it does not automatically launch and wait for the container to finish setup - you may need to refresh the page once or twice before it will populate).
  6. Query for a wide time-range (7d or so) and search for a very generic query like the following in graph view: sum(kube_node_status_condition{condition="Ready", status="true"}==1) #number of ready nodes in the cluster.
  7. highlight the block where you see data in the graph to see the specific time segments you exported which will adjust your range to this timeframe only and then run the queries you want with the correctly scoped view.
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.