ODF Clone operations stuck in 'pending'

Solution Verified - Updated

Environment

  • OCS/ODF 4.x

Issue

  • ODF Clone operations (creates, deletes, etc.) are stuck perpetually in a 'pending' state

Resolution

  1. Get the mon IPs:
    When using monitor port 6789:

    # oc get cm/rook-ceph-csi-config  -o=go-template='{{index .data "csi-cluster-config-json"}}' | jq -r '.[0].monitors | join(",")'
    172.30.185.38:6789,172.30.101.171:6789,172.30.227.154:6789
    

When using secure monitor ports 3300:

```
# oc get cm/rook-ceph-csi-config  -o=go-template='{{index .data "csi-cluster-config-json"}}' | jq -r '.[0].monitors | join(",")'
172.30.185.38:3300,172.30.101.171:3300,172.30.227.154:3300
```
  1. Get the admin key:

    # oc get secret/rook-ceph-mon -o=go-template='{{index .data "ceph-secret" | base64decode }}'
    AQCIL/FjGrzbAhAAyNu5VnRhSB1PJ426jf7lEQ==
    
  2. Access worker node terminal chroot /host create a temporary mount point:

    # mkdir /mnt/ceph-cleanup
    
  3. Then use the values obtained in steps 1 and 2 to mount the /volumes file system:
    When using monitor port 6789:

    # mount -t ceph <mon-ips>:/ /mnt/ceph-cleanup -o name=admin,secret='<secret>'
    

When using secure monitor ports 3300:

```
# mount -t ceph <mon-ips>:/ /mnt/ceph-cleanup -o name=admin,secret='<secret>',ms_mode=secure
```
  1. List the contents of /mnt/ceph-cleanup/volumes/_index/clone and remove the symlinks matching the missing entries (please note that these volume identifiers will be different in your configuration):

    # ls -l /mnt/ceph-cleanup/volumes/_index/clone
    8809690a-7b48-4833-a21a-9cee46e47390
    8809690a-7b48-4833-a21a-9cee46e47391
    8809690a-7b48-4833-a21a-9cee46e47393
    21936f6a-3ab5-47c8-a57f-aaf1ef02d77b
    

    For all of the above volumes, proceed with the following steps:

    • Check that the symlink does not point to an existing clone in the volumes directory. The output of ls -l will show the path symlink points to (the path after ->). We have mounted the cephfs filesystem in /mnt/ceph-cleanup, in this case we have to check the existence of the directory with /mnt/ceph-cleanup prefixed:

           # ls -l /mnt/ceph-cleanup/volumes/_index/clone/8809690a-7b48-4833-a21a-9cee46e47390
           lrwxrwxrwx 1 root root 31 Jul 16 15:16 /mnt/volumes/_index/clone/8809690a-7b48-4833-a21a-9cee46e47390 -> /volumes/_nogroup/clone_sub_0_0
      
           # stat /mnt/ceph-cleanup/volumes/_nogroup/clone_sub_0_0
           stat: cannot statx '/mnt/ceph-cleanup/volumes/_nogroup/clone_sub_0_0': No such file or directory
      

      Only if you see the No such file or directory as indicated above proceed with the step to delete the symlink:

        # rm -vf /mnt/ceph-cleanup/volumes/_index/clone/8809690a-7b48-4833-a21a-9cee46e47390
        removed '/mnt/cephfs/volumes/_index/clone/8809690a-7b48-4833-a21a-9cee46e47390
      
    • If the above stat output does exist, it will print something similar to, in this case do not proceed with deleting the symlink:

           # stat /mnt/ceph-cleanup/volumes/_nogroup/clone_sub_0_0
             File: /mnt/ceph-cleanup/volumes/_nogroup/clone_sub_0_0
             Size: 2         	Blocks: 0          IO Block: 65536  directory
           Device: 99h/153d	Inode: 1099511629313  Links: 3
           Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
           Context: system_u:object_r:cephfs_t:s0
           Access: 2023-09-14 12:07:13.643604709 -0400
           Modify: 2023-09-14 12:07:18.264679949 -0400
           Change: 2023-09-14 12:07:18.264679949 -0400
            Birth: 2023-09-14 12:07:13.643604709 -0400
      

      When the clone directory still exist, then please proceed using the ceph fs clone status and ceph fs clone cancel commands to cancel the clone operation. If this was successful, you can then proceed to removing the subvolume with the ceph fs subvolume rm command using the --force parameter.

Root Cause

The clone is failing because the base path of the clone doesn't exist anymore. This subsequent result is that the in-progress clone is deleted forcibly.

The clone operation tracks the on going clones using a symlink to destination clone path in the /volumes/_index/clone directory. Once the clone is completed, the symlink is removed. In this case, while the clone is in progress, the clone subvolume is being removed forcibly. The clone removal operation has left the symlink but removed the clone.

Diagnostic Steps

  • Describe the volumesnapshot's associated volumesnapshotcontent Custom Resource:
$  oc describe volumesnapshotcontent/<volumesnapshot-content> |grep -i warning
  Warning  SnapshotDeleteError  2m45s (x58 over 3h11m)  csi-snapshotter openshift-storage.cephfs.csi.ceph.com  Failed to delete snapshot
  Warning  SnapshotDeleteError  80s (x614 over 37h)  csi-snapshotter openshift-storage.cephfs.csi.ceph.com  Failed to delete snapshot
  Warning  SnapshotDeleteError  2m40s (x998 over 2d13h)  csi-snapshotter openshift-storage.cephfs.csi.ceph.com  Failed to delete snapshot
  Warning  SnapshotDeleteError  2m39s (x1190 over 3d1h)  csi-snapshotter openshift-storage.cephfs.csi.ceph.com  Failed to delete snapshot
  Warning  SnapshotDeleteError  2m39s (x1576 over 4d2h)  csi-snapshotter openshift-storage.cephfs.csi.ceph.com  Failed to delete snapshot
  • Get the csi-vol- of the PVC that's being cloned:
# oc get pv -o 'custom-columns=NAME:.spec.claimRef.name,PVNAME:.metadata.name,STORAGECLASS:.spec.storageClassName,IMAGENAME:.spec.csi.volumeAttributes.subvolumeName' | grep < pvc-name >
  • Get info on the csi-vol-< suffix-id > snapshots ( To be ran from the rook-ceph-tools pod )
sh-4.4$ ceph fs subvolume snapshot ls ocs-storagecluster-cephfilesystem csi-vol-613a3c90-b22c-11ea-8f58-0a580ae0101a csi
[
    {
        "name": "csi-snap-09f2c07f-aeca-11ed-b704-0a580ae0160c"
    },
    {
        "name": "csi-snap-2991ff43-af30-11ed-b704-0a580ae0160c"
  • Inspect the snapshot info in the rook-ceph-tools pod:
$ ceph fs subvolume snapshot info  ocs-storagecluster-cephfilesystem csi-vol-613a3c90-b22c-11ea-8f58-0a580ae0101a csi-snap-a34f2252-b18b-11ed-b704-0a580ae0160c csi
{
    "created_at": "2023-02-21 02:01:21.696892",
    "data_pool": "ocs-storagecluster-cephfilesystem-data0",
    "has_pending_clones": "yes",         <<<---**
    "size": 1636134356447
SBR
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.