Prometheus pods in CrashLoopBackOff with error 'opening storage failed invalid block sequence'

Solution Verified - Updated

Environment

  • OpenShift Container Platform (OCP)
    • v3.11

Issue

  • Prometheus container in prometheus-k8s pod in error state.
prometheus-k8s-0                              3/4       CrashLoopBackOff   977        4d
prometheus-k8s-1                              3/4       CrashLoopBackOff   984        4d

Resolution

  • The error described in the pod logs is visible when NFS is used as backend storage for prometheus pods.
  • NFS storage is unsupported by upstream Prometheus project according to the Content from prometheus.io is not included.documentation, so corruptions like this may happen.
  • In regard to this particular issue, attempt to delete one of the overlapping blocks can be done, which at most could delete a 2h range of data (given that we only have a retention of 15 days this seems somewhat minor). In the worst case there will be need to wipe the whole storage for prometheus.

Diagnostic Steps

  • The logs from the prometheus container show below messages and errors.
  # oc log prometheus-k8s-0 -c prometheus -n openshift-monitoring

    [..]
  caller=repair.go:39 component=tsdb msg="found healthy block" mint=XXXXX maxt=XXXXX ulid=XXXXX  
  caller=main.go:596 err="Opening storage failed invalid block sequence: block time ranges overlap: [mint: XXX, maxt: XXX, range: XhXmXs, blocks: XXX]: <ulid: XXX, mint: XXX, maxt: XXX, range: XhXmXs>...
    [..]
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.