ODF: MDS pods in CrashLoopBackOff (CLBO) and "EMetaBlob.replay" and "sessionmap" are in the traceback.
Environment
Red Hat OpenShift Container Storage (OCS) 4.x
Red Hat OpenShift Cluster Platform (OCP) 4.x
Red Hat OpenShift Data Foundation (ODF) 4.x
Red Hat Ceph Storage (RHCS) 4.x
Red Hat Ceph Storage (RHCS) 5.x
Red Hat Ceph Storage (RHCS) 6.x
Issue
MDS pods in CrashLoopBackOff (CLBO) and EMetaBlob.replay and sessionmap are in the traceback.
The Ceph MDS service is in CrashLoopBackOff (CLBO). To continue with this article, the signature of the crash must be similar to the example below.
$ oc get pods | grep rook-ceph-mds-ocs
NAME READY STATUS RESTARTS AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-b9bd569fbdkk5 1/2 Running 384 1d
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5bbccc9d88zs5 1/2 Running 383 1d
$ oc get events
Pod openshift-storage/rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5bbccc9d88zs5 (mds) is in waiting state (reason: "CrashLoopBackOff")
Pod openshift-storage/rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-b9bd569fbdkk5 (mds) is in waiting state (reason: "CrashLoopBackOff")
Crash signature:
Please note EMetaBlob.replay sessionmap in the traceback:
$ ceph crash ls
$ ceph crash info <crash-id>
debug 2023-01-13 01:22:37.619 7fddbca74700 1 mds.0.175006 waiting for osdmap 47157 (which blacklists prior instance)
debug 2023-01-13 01:22:37.638 7fddb6267700 0 mds.0.cache creating system inode with ino:0x100
debug 2023-01-13 01:22:37.638 7fddb6267700 0 mds.0.cache creating system inode with ino:0x1
/builddir/build/BUILD/ceph-14.2.11/src/mds/journal.cc: In function 'void EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)' thread 7fddb4a64700 time 20
23-01-13 01:22:37.721236
/builddir/build/BUILD/ceph-14.2.11/src/mds/journal.cc: 1551: FAILED ceph_assert(g_conf()->mds_wipe_sessions)
debug 2023-01-13 01:22:37.719 7fddb4a64700 -1 log_channel(cluster) log [ERR] : EMetaBlob.replay sessionmap v 145397255 - 1 > table 0
ceph version 14.2.11-208.el8cp (6738ba96f296a41c24357c12e8d594fbde457abc) nautilus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x156) [0x7fddc6847308]
2: (()+0x275522) [0x7fddc6847522]
3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x6b54) [0x55b8999231b4]
4: (EUpdate::replay(MDSRank*)+0x40) [0x55b899925740]
5: (MDLog::_replay_thread()+0xbee) [0x55b8998c49ae]
6: (MDLog::ReplayThread::entry()+0x11) [0x55b8996299c1]
7: (()+0x817a) [0x7fddc462717a]
8: (clone()+0x43) [0x7fddc313edc3]
*** Caught signal (Aborted) **
in thread 7fddb4a64700 thread_name:md_log_replay
debug 2023-01-13 01:22:37.720 7fddb4a64700 -1 /builddir/build/BUILD/ceph-14.2.11/src/mds/journal.cc: In function 'void EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)' thread 7fddb4a64700 time 2023-01-13 01:22:37.721236
/builddir/build/BUILD/ceph-14.2.11/src/mds/journal.cc: 1551: FAILED ceph_assert(g_conf()->mds_wipe_sessions)
Resolution
The solution is to stop the MDS to abate the restart loop and apply mds_wipe_sessions to the Ceph Configuration Database. After the MDS comes up and is stable, the parameter mds_wipe_sessions must be removed to avoid unintended consequences at a later time.
To get started, open one Putty/Terminal session to run OpenShift administrative commands. Then, using KCS article #4628891, open a second Putty/Terminal session to the Ceph CLI. Once these 2 sessions are active, follow these steps.
Step by step:
- [OpenShift]: Scale the mds deployments down:
$ oc scale deployment -l app=rook-ceph-mds --replicas 0 -n openshift-storage
- [OpenShift]: Ensure pods have been terminated (shouldn't return any output):
$ oc get pods -n openshift-storage | grep mds
- [Ceph CLI]: Set
mds_wipe_sessionsequal totrueand make sure the config option took:
$ ceph config set mds mds_wipe_sessions true
$ ceph config dump | grep mds_wipe
- [OpenShift]: Scale the following mds deployments up:
$ oc scale deployment -l app=rook-ceph-mds --replicas 1 -n openshift-storage
- [OpenShift]: Confirm mds pods are
Running:
$ oc get pods -n openshift-storage | grep mds
- [Ceph CLI]: Once the MDS pods come back online, allow a minute to pass and then check the Ceph Status:
(A status ofHEALTH_OKand all PG beingactive+cleanindicates success)
$ ceph -s
cluster:
id: 1803fxxx-Redacted-Cluster-ID-yyy63e87aaba
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 25h)
mgr: a(active, since 25h)
mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
osd: 3 osds: 3 up (since 25h), 3 in (since 25h)
data:
pools: 3 pools, 96 pgs
objects: 153 objects, 375 MiB
usage: 3.9 GiB used, 3.0 TiB / 3 TiB avail
pgs: 96 active+clean ***
io:
client: 853 B/s rd, 11 KiB/s wr, 1 op/s rd, 1 op/s wr
- [Ceph CLI]: Once
ceph statusoutput returnsHEALTH_OKand all PGsactive+cleanclearmds_wipe_sessions:
$ ceph config rm mds mds_wipe_sessions
$ ceph config dump | grep mds_wipe
(should be no output)
Root Cause
If there is a corrupt MDS session entry, both MDS pods will enter into CLBO. At MDS startup, one of the initial steps is to reestablish all sessions, but due to one or more corrupt client session, the MDS will crash. RH has observed 3 use cases where this issue may occur.
- The issue may occur post ODF/OCP upgrade.
- The issue may occur if a Database workload is backed by Ceph FS / Ceph MDS. (See
Diagnostic Steps). - The issue may occur after resolving the issue detailed in KCS article #6835851
Diagnostic Steps
- Please see This page is not included, but the link has been rewritten to point to the nearest parent document.OpenShift Container Platform / Post-installation configuration / Storage configuration
- In the subsections:
Other specific application storage recommendationswe see this statement: - Databases (RDBMSs, NoSQL DBs, etc.) tend to perform best with dedicated block storage.
- When Ceph is the backing storage, use Ceph RBD devices for for Database workloads and not Ceph FS volumes
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.