OpenShift Data Foundation (ODF) Ceph Storage Full: Usage Spike due to Failed OADP Snapshots
Environment
- Red Hat OpenShift Container Platform (OCP) 4.x
- Red Hat OpenShift Container Storage (OCS) 4.x
- Red Hat OpenShift Data Foundation (ODF) 4.x
Issue
- ODF Ceph storage capacity is reported as full or unable to understand why usage is spiking
- fstrim command fails to reclaim space
- Unable to analyze Ceph storage usage effectively
- VolumeSnapshotContents Failing to Delete
- VolumeSnapshots in an "Error" state
Resolution
- Edit the
csiSnapshotTimeoutspec in the OADP Backup CR to allow enough time for larger volumes to succeed in both backup and restore operations.
spec:
csiSnapshotTimeout: 10m0s <----- Change this to something much larger (~60m or higher) to accomodate larger volume.
- Follow the steps outlined in the Finding and removing orphaned RBD images in ODF and OCS 4.x or This content is not included.ODF-4.15 | Listing and cleaning stale cephfs subvolumes soltuion to recover data from the orphaned volumes.
- Finally, for RBD volumes only, execute fstrim on all storage nodes to discard any uneeded blocks by following the steps outlined in the ODF is showing "nearfull osd(s)" warning solution.
Root Cause
When larger volumes are being backed up/snapshot, and there is not enough time to accomplish thie before the OADP application hits the timeout threshold in respect to the set csiSnapshotTimeout spec which could lead to orphan volumes.
Diagnostic Steps
$ oc get backup -n <namespace> <backup-name> -o yaml
Name: website
Namespace: openshift-adp
...output omitted...
Spec:
Csi Snapshot Timeout: 10m0s <-----------------
Default Volumes To Fs Backup: false
Included Namespaces:
website
Included Resources:
imagestreams
buildconfigs
deployments
services
routes
$ ceph -s
health: HEALTH_ERR
2 backfillfull osd(s)
1 full osd(s)
2 nearfull osd(s)
Low space hindering backfill (add storage if this doesn't resolve itself): 27 pgs backfill_toofull
12 pool(s) full
services:
mon: 3 daemons, quorum b,c,e (age 3d)
mgr: a(active, since 8w), standbys: b
mds: 1/1 daemons up, 1 hot standby
osd: 8 osds: 8 up (since 3d), 8 in (since 8w); 27 remapped pgs
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 378 pgs
objects: 956.81k objects, 3.5 TiB
usage: 11 TiB used, 3.2 TiB / 14 TiB avail
pgs: 160181/2870427 objects misplaced (5.580%)
351 active+clean
27 active+remapped+backfill_toofull
io:
client: 1.5 KiB/s rd, 1 op/s rd, 0 op/s wr
SBR
Product(s)
Components
Category
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.