OpenShift Data Foundation (ODF) Ceph Storage Full: Usage Spike due to Failed OADP Snapshots

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (OCP) 4.x
  • Red Hat OpenShift Container Storage (OCS) 4.x
  • Red Hat OpenShift Data Foundation (ODF) 4.x

Issue

  • ODF Ceph storage capacity is reported as full or unable to understand why usage is spiking
  • fstrim command fails to reclaim space
  • Unable to analyze Ceph storage usage effectively
  • VolumeSnapshotContents Failing to Delete
  • VolumeSnapshots in an "Error" state

Resolution

  1. Edit the csiSnapshotTimeout spec in the OADP Backup CR to allow enough time for larger volumes to succeed in both backup and restore operations.
spec:
  csiSnapshotTimeout: 10m0s <----- Change this to something much larger (~60m or higher) to accomodate larger volume.
  1. Follow the steps outlined in the Finding and removing orphaned RBD images in ODF and OCS 4.x or This content is not included.ODF-4.15 | Listing and cleaning stale cephfs subvolumes soltuion to recover data from the orphaned volumes.
  2. Finally, for RBD volumes only, execute fstrim on all storage nodes to discard any uneeded blocks by following the steps outlined in the ODF is showing "nearfull osd(s)" warning solution.

Root Cause

When larger volumes are being backed up/snapshot, and there is not enough time to accomplish thie before the OADP application hits the timeout threshold in respect to the set csiSnapshotTimeout spec which could lead to orphan volumes.

Diagnostic Steps

$ oc get backup -n <namespace> <backup-name> -o yaml

Name:         website
Namespace:    openshift-adp
...output omitted...
Spec:
  Csi Snapshot Timeout: 10m0s <----------------- 
  Default Volumes To Fs Backup:  false
  Included Namespaces:
    website
  Included Resources:
    imagestreams
    buildconfigs
    deployments
    services
    routes

$ ceph -s

    health: HEALTH_ERR
            2 backfillfull osd(s)
            1 full osd(s)
            2 nearfull osd(s)
            Low space hindering backfill (add storage if this doesn't resolve itself): 27 pgs backfill_toofull
            12 pool(s) full
 
  services:
    mon: 3 daemons, quorum b,c,e (age 3d)
    mgr: a(active, since 8w), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 8 osds: 8 up (since 3d), 8 in (since 8w); 27 remapped pgs
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 378 pgs
    objects: 956.81k objects, 3.5 TiB
    usage:   11 TiB used, 3.2 TiB / 14 TiB avail
    pgs:     160181/2870427 objects misplaced (5.580%)
             351 active+clean
             27  active+remapped+backfill_toofull
 
  io:
    client:   1.5 KiB/s rd, 1 op/s rd, 0 op/s wr
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.