Ceph - rbd image cannot be deleted with "rbd: error: image still has watchers"

Solution Verified - Updated

Environment

  • Red Hat Ceph Storage (RHCS)
    • 4
    • 5

Issue

Deleting and rbd image fails with the following error:

# rbd rm rbdtest/testimage1
2021-02-25 03:30:32.543 7f9be77fe700 -1 librbd::image::PreRemoveRequest: 0x56251d7efa10 check_image_watchers: image has watchers - not removing
Removing image: 0% complete...failed.
rbd: error: image still has watchers
This means the image is still open or the client using it crashed. Try again after closing/unmapping it or waiting 30s for the crashed client to timeout.

Resolution

"rbd: error: image still has watchers" means there is still process on the client node that is using this image.

  • If it is possible to identify the process that is using the image, use following steps:

    1. Get the watcher details:

      # POOL=<pool>
      # IMAGE_NAME=<image_name>
      # rbd status ${POOL}/${IMAGE_NAME}
      Watchers:
          watcher=10.0.0.10:0/2445590874 client.4788 cookie=18446462598732840961
      
    2. On the client '10.0.0.10' identify if the image is kernel module mapped - mapped by rbd map commad:

      # lsblk | grep -e rbd -e NAME
      NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
      rbd0   251:0    0    1G  0 disk 
      
    3. If needed umount it and then unmap it:

              # RBD=rbd<ID>
              # rbd unmap /dev/${RBD}
              #  lsblk | grep -e rbd -e NAME
              NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
      
              # rbd unmap ${POOL}/${IMAGE_NAME}
              #  lsblk | grep -e rbd -e NAME
              NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
      
    4. Now there should not be any watchers and the rbd image can be removed

      # rbd status ${POOL}/${IMAGE_NAME}
      Watchers: none
      # rbd rm ${POOL}/${IMAGE_NAME}
      Removing image: 100% complete...done.
      

      Note:
      If the process cannot be identified, the client node can be blacklisted, BUT it will blacklist client I/O against this node for all rbd images and other Ceph client operations on this node.
      This can be identified by listing the qemu processes - example form OpenStack environment:

              # ps -ef | grep qemu
      
              # sudo rbd status volumes/volume-2c799d1b-512f-45cc-bda2-2d054ad9975f
              Watchers:
                  watcher=192.0.0.11:0/1065190154 client.110477 cookie=93911546382976
      
              - from node '192.0.0.11' - in this case compute-0 node
              # ps -ef | grep qemu | grep volume-2c799d1b
              qemu      604262   74428  9 09:30 ?        00:02:44 /usr/libexec/qemu-kvm -name guest=instance-0000001f,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-10-instance-0000001f/master-key.aes -machine pc-i440fx-rhel7.6.0,accel=kvm,usb=off,dump-guest-core=off -cpu Skylake-Server-IBRS,ss=on,hypervisor=on,tsc_adjust=on,clflushopt=on,pku=on,stibp=on,ssbd=on,ibpb=on -m 2048 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 6eefa7e9-684b-4bbe-8d6e-fe1b6a2b384d -smbios type=1,manufacturer=Red Hat,product=OpenStack Compute,version=17.0.13-30.el7ost,serial=aebfb72f-3ce2-4758-aac1-5cb0f98db701,uuid=6eefa7e9-684b-4bbe-8d6e-fe1b6a2b384d,family=Virtual Machine -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=42,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -object secret,id=virtio-disk0-secret0,data=zrvDmuxIk5twh6GYBsF7wH5ECygP6IaTwcg/DhC4tqs=,keyid=masterKey0,iv=0qVp/k/L5t5BSbHgMDtmmw==,format=base64 -drive file=rbd:volumes/volume-2c799d1b-512f-45cc-bda2-2d054ad9975f:id=openstack:auth_supported=cephx\;none:mon_host=192.168.0.21\:6789\;192.168.0.65\:6789\;192.168.0.98\:6789,file.password-secret=virtio-disk0-secret0,format=raw,if=none,id=drive-virtio-disk0,serial=2c799d1b-512f-45cc-bda2-2d054ad9975f,cache=writeback,discard=unmap -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=on -netdev tap,fd=44,id=hostnet0,vhost=on,vhostfd=45 -device virtio-net-pci,rx_queue_size=512,host_mtu=1450,netdev=hostnet0,id=net0,mac=fa:16:3e:0a:e8:43,bus=pci.0,addr=0x3 -add-fd set=3,fd=48 -chardev pty,id=charserial0,logfile=/dev/fdset/3,logappend=on -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 192.0.0.64:4 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
      
  • If there is not possible to identify the process, the client node can be blacklisted, BUT it will blacklist client I/O against this node for all rbd images and other Ceph client operations on this node.
    So use following steps with caution:

    1. Blacklist the watcher from 'rbd status' output:

      # ceph osd blacklist add 10.0.0.10:0/2445590874
      
    2. Check for the watchers, if none, remove the image:

      # POOL=<pool>
      # IMAGE_NAME=<image_name>
      # rbd status ${POOL}/${IMAGE_NAME}
      Watchers: none
      # rbd rm ${POOL}/${IMAGE_NAME}
      Removing image: 100% complete...done.
      
    3. Remove the blacklist record, so the client can join the Ceph cluster again:

      # ceph osd blacklist rm 10.0.0.10:0/2445590874
      
SBR
Category
Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.