Ceph/ODF: CephFS VolumeSnapshot creation (and restore) takes a very long time to complete.

Solution Verified - Updated 18 Sept 2025

Environment

Red Hat OpenShift Container Platform (OCP) 4.14+
Red Hat OpenShift Data Foundation (ODF) 4.14+
Red Hat Ceph Storage (RHCS) 6+

Issue

CephFS VolumeSnapshot creation (and restore) takes a very long time to complete.
Cephfs clone (from PVC) or Cephfs snapshot restore (from VolumeSnapshot) are left in Pending status.

Resolution

This solution focuses only on implementing changes which prevent this issue from occurring.

To identify and remove Orphaned Cephfs Subvolumes, follow this solution: Removing Orphaned Cephfs Subvolumes and Snapshots in ODF.

Ensure that backups software (OADP, Velero, Kasten, Commvault etc.) has their timeout for Clone/Snapshot creation set to a high value.
- Current Subvolume clones are full copies of a snapshot and take an extremely long time to create, (hours).
- Fast Clones will be Copy-on-Write based and should be supported in Ceph 9.2.
- This is not a promise - stay in touch with your local Red Hatters, who can query Development on your behalf.
Run the command below [2] to see how many CephFS Clones are Pending.
For a Clone operation which is stuck, see ODF Clone operations stuck in 'pending'.
This KCS Solution is offered for Clone operations which are stuck endlessly.
Ensure that Concurrent Clones are set to a value which makes sense for the workload.
- The default value for Concurrent Clones is 4.
  - If 6 Clones Creation tasks are running, 4 will execute and 2 will queue.
  - This will only cause even more delays for certain backup jobs that are stuck in queue.
- Increase this parameter by 4 and observe how backups perform - Do not just jump to some large number.
- See steps below to change this parameter. [1]
Ensure the Ceph OSDs are tuned properly.
- See How to tune Ceph OSDs using mClock
- Also ensure the CephFS Data Pool has enough PGs.
- Engage with Red Hat Tech Support as needed with either of these tunings suggestions.

[1]

$ ceph config dump | egrep "^WHO|clone"
WHO            MASK  LEVEL     OPTION                                 VALUE
{no output means this parameter is at the default value of 4}

$ ceph config set mgr mgr/volumes/max_concurrent_clones 8             ## Again, go up by 4 so 8,12,16,20

Trust, but verify:
$ ceph config dump | egrep "^WHO|clone"
WHO            MASK  LEVEL     OPTION                                 VALUE
mgr                  advanced  mgr/volumes/max_concurrent_clones      8

[2]

sh-4.4$ for i in `ceph fs subvolume ls ocs-storagecluster-cephfilesystem csi --format json | jq '.[] | .name' | cut -f 2 -d '"'`; do echo "Subvolume : $i"; ceph fs clone status ocs-storagecluster-cephfilesystem $i csi; done

Root Cause

There can be several factors, most probably are the following two:

Current Subvolume clones are full copies of a snapshot and take an extremely long time to create, (hours).
The number of concurrent clones is too small for the backup workload.
The OSD performance is low.
The number of PGs for the CephFS Pools is too low.

Diagnostic Steps

In the logs of the csi-cephfsplugin-provisioner pod we can find errors like:


  % oc logs csi-cephfsplugin-provisioner-597ffb4f96-w7b59

  E0109 13:36:29.045691       1 utils.go:200] ID: 6685603 Req-ID: 0001-0011-openshift-storage-0000000000000001-f2989966-8549-11ed-bca5-0a580a81040c GRPC error: rpc error: code = FailedPrecondition desc = snapshot 0001-0011-openshift-storage-0000000000000001-f2989966-8549-11ed-bca5-0a580a81040c has pending clones

  or

  failed to provision volume with StorageClass "ocs-storagecluster-cephfs": rpc error: code = Aborted desc = clone from snapshot is pending

We may find also OCP events like:

  Normal PVCReconciled 4m VolumeSnapshotBackup-Controller performed created on PVC snapcontent-xxxx-xxxx-pvc

  Warning ProvisioningFailed 2m52s (x8 over 3m59s) openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-xxxxx failed to provision volume with StorageClass "ocs-storagecluster-cephfs": rpc error: code = Aborted desc = clone from snapshot is pending

  Normal Provisioning 108s (x9 over 4m) openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-xxxxx External provisioner is provisioning volume for claim "openshift-adp/snapcontent-xxxx-xxxx-pvc"

  Warning ProvisioningFailed 108s openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-xxxxx failed to provision volume with StorageClass "ocs-storagecluster-cephfs": rpc error: code = Aborted desc = clone from snapshot is already in progress

  Normal ExternalProvisioning 14s (x17 over 4m) persistentvolume-controller waiting for a volume to be created, either by external provisioner "openshift-storage.cephfs.csi.ceph.com" or manually created by system administrator

Unfortunately the events listed above look like an error at first glance. They are in-fact perfectly normal and do not indicate that a clone has failed to provision, but rather that the clone is still in the process of being created. This is seen only for cephfs clones because cephfs clones are full copies and take time.

SBR

Product(s)

Category

Troubleshoot

Tags

ocs_ceph-csi

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.