Ceph/ODF: MDS Crashing (CLBO), the crash backtrace shows "_unlink_" or "_unlink_local".

Solution Verified - Updated 14 Jun 2025

Environment

Red Hat OpenShift Container Platform (OCP) 4.x
Red Hat OpenShift Container Storage (OCS) 4.x
Red Hat OpenShift Data Foundation (ODF) 4.x
Red Hat Ceph Storage (RHCS) 5.x
Red Hat Ceph Storage (RHCS) 6.x
Red Hat Ceph Storage (RHCS) 7.x
Ceph File System (CephFS)

Issue

MDS Crashing (CLBO), the crash backtrace shows _unlink_ or _unlink_local.

The Ceph metadata daemon is crashing frequently and unlink is seen in the backtrace of the crash

/builddir/build/BUILD/ceph-14.2.11/src/mds/Server.cc: In function 'void Server::_unlink_local(MDRequestRef&, CDentry*, CDentry*)' thread 7f7848acc700 time 2022-02-23 08:48:31.877094
/builddir/build/BUILD/ceph-14.2.11/src/mds/Server.cc: 7023: FAILED ceph_assert(in->first <= straydn->first)
 ceph version 14.2.11-208.el8cp (6738ba96f296a41c24357c12e8d594fbde457abc) nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x156) [0x7f78578a9308]
 2: (()+0x275522) [0x7f78578a9522]
 3: (Server::_unlink_local(boost::intrusive_ptr<MDRequestImpl>&, CDentry*, CDentry*)+0xfbc) [0x558e8d56654c]
 4: (Server::handle_client_unlink(boost::intrusive_ptr<MDRequestImpl>&)+0xd4c) [0x558e8d56b73c]
 5: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xaab) [0x558e8d58122b]
 6: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x402) [0x558e8d5819a2]
 7: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x12a) [0x558e8d58e44a]
 8: (MDSRank::handle_deferrable_message(boost::intrusive_ptr<Message const> const&)+0xa94) [0x558e8d4f7344]
 9: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x80f) [0x558e8d4f975f]
 10: (MDSRank::retry_dispatch(boost::intrusive_ptr<Message const> const&)+0x16) [0x558e8d4f9d66]
 11: (MDSContext::complete(int)+0x7f) [0x558e8d79b5df]
 12: (MDSRank::_advance_queues()+0xac) [0x558e8d4f86ec]
 13: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x1ed) [0x558e8d4f913d]
 14: (MDSRank::retry_dispatch(boost::intrusive_ptr<Message const> const&)+0x16) [0x558e8d4f9d66]
 15: (MDSContext::complete(int)+0x7f) [0x558e8d79b5df]
 16: (MDSRank::_advance_queues()+0xac) [0x558e8d4f86ec]
 17: (MDSRank::ProgressThread::entry()+0x45) [0x558e8d4f8e25]
 18: (()+0x817a) [0x7f785568917a]
 19: (clone()+0x43) [0x7f78541a0dc3]
debug 2022-02-23 08:48:31.877 7f7848acc700 -1 /builddir/build/BUILD/ceph-14.2.11/src/mds/Server.cc: In function 'void Server::_unlink_local(MDRequestRef&, CDentry*, CDentry*)' thread 7f7848acc700 time 2022-02-23 08:48:31.877094
/builddir/build/BUILD/ceph-14.2.11/src/mds/Server.cc: 7023: FAILED ceph_assert(in->first <= straydn->first)

To view current crashes utilize the following command ceph crash ls. To view more information regarding the crash ceph crash info <crashid>. For additional information see Content from docs.ceph.com is not included.Ceph Crash Module

Resolution

To avoid MDS Corruption, move the folder _deleting to an off-named folder. Please see diagnostics steps for details.
After the MDS has stabilized, Engineering will be consulted to determine how to resolve the corruption.

Root Cause

When the MDS is removing deleted CSI Volume data, MDS Metadata corruption is encountered causing the MDS pods to be in CrashLoopBackOff, (CLBO).

Diagnostic Steps

Scale down the Ceph-MGR deployment. Within a few seconds the Ceph-MDS should stabilize as CephFS deletes are processed through the Ceph-MGR daemons.

$ oc scale deployment -l app=rook-ceph-mgr --replicas=0

Debug into an OpenShift Worker node and follow Mount the ODF CephFS volume on an OpenShift Worker Node to mount the CephFS Volume.
Navigate to the mount point where the CephFS Volume is mounted and list its contents:

sh-5.1# cd /mnt/cephfs/
sh-5.1# ls -l
-rwxr-xr-x.  1 root root  0 Nov 22 10:17 _csi:csi-vol-d9400293-6a4e-11ed-9693-0a580a02181f.meta
-rwxr-xr-x.  1 root root  0 Nov 28 14:20 _csi:csi-vol-d97b6574-6f27-11ed-9693-0a580a02181f.meta
-rwxr-xr-x.  1 root root  0 Jan 19 12:15 _csi:csi-vol-f87eb94a-97f2-11ed-92d4-0a580a021821.meta
-rwxr-xr-x.  1 root root  0 Jan 19 12:15 _csi:csi-vol-fba1de4a-97f2-11ed-92d4-0a580a021821.meta
drwx------.  2 root root  1 Jan 19 12:23 _deleting
drwx------.  3 root root  1 Sep 16 13:25 _index
drwx------.  2 root root  0 Jan 19 09:48 _legacy
drwxr-xr-x. 44 root root 42 Jan 19 13:07 csi

As a temporary workaround, rename the _deleting folder to _deleting.tmp

sh-5.1# mv _deleting _deleting.tmp

Scale up the Manager deployment

$ oc scale deployment -l app=rook-ceph-mgr --replicas=1

Verify that the Ceph-MDS pods are in a Running state

$ oc get pods -n openshift-storage -l app=rook-ceph-mds
NAME                                                              READY   STATUS    RESTARTS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-69db9946vqcmm   2/2     Running   0          13d
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7f6bb8c9z5s8k   2/2     Running   0          13d

SBR

Ceph
OCS

Product(s)

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.