ODF: MDS pods in CrashLoopBackOff (CLBO) and "EMetaBlob.replay" and "sessionmap" are in the traceback.

Solution Verified - Updated

Environment

Red Hat OpenShift Container Storage (OCS) 4.x
Red Hat OpenShift Cluster Platform (OCP) 4.x
Red Hat OpenShift Data Foundation (ODF) 4.x
Red Hat Ceph Storage (RHCS) 4.x
Red Hat Ceph Storage (RHCS) 5.x
Red Hat Ceph Storage (RHCS) 6.x

Issue

MDS pods in CrashLoopBackOff (CLBO) and EMetaBlob.replay and sessionmap are in the traceback.

The Ceph MDS service is in CrashLoopBackOff (CLBO). To continue with this article, the signature of the crash must be similar to the example below.

$ oc get pods | grep rook-ceph-mds-ocs
NAME                                                             READY  STATUS   RESTARTS  AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-b9bd569fbdkk5  1/2    Running  384       1d
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5bbccc9d88zs5  1/2    Running  383       1d

$ oc get events
Pod openshift-storage/rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5bbccc9d88zs5 (mds) is in waiting state (reason: "CrashLoopBackOff")
Pod openshift-storage/rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-b9bd569fbdkk5 (mds) is in waiting state (reason: "CrashLoopBackOff")

Crash signature:
Please note EMetaBlob.replay sessionmap in the traceback:

$ ceph crash ls
$ ceph crash info <crash-id>

debug 2023-01-13 01:22:37.619 7fddbca74700  1 mds.0.175006  waiting for osdmap 47157 (which blacklists prior instance)
debug 2023-01-13 01:22:37.638 7fddb6267700  0 mds.0.cache creating system inode with ino:0x100
debug 2023-01-13 01:22:37.638 7fddb6267700  0 mds.0.cache creating system inode with ino:0x1
/builddir/build/BUILD/ceph-14.2.11/src/mds/journal.cc: In function 'void EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)' thread 7fddb4a64700 time 20
23-01-13 01:22:37.721236
/builddir/build/BUILD/ceph-14.2.11/src/mds/journal.cc: 1551: FAILED ceph_assert(g_conf()->mds_wipe_sessions)
debug 2023-01-13 01:22:37.719 7fddb4a64700 -1 log_channel(cluster) log [ERR] : EMetaBlob.replay sessionmap v 145397255 - 1 > table 0
 ceph version 14.2.11-208.el8cp (6738ba96f296a41c24357c12e8d594fbde457abc) nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x156) [0x7fddc6847308]
 2: (()+0x275522) [0x7fddc6847522]
 3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x6b54) [0x55b8999231b4]
 4: (EUpdate::replay(MDSRank*)+0x40) [0x55b899925740]
 5: (MDLog::_replay_thread()+0xbee) [0x55b8998c49ae]
 6: (MDLog::ReplayThread::entry()+0x11) [0x55b8996299c1]
 7: (()+0x817a) [0x7fddc462717a]
 8: (clone()+0x43) [0x7fddc313edc3]
*** Caught signal (Aborted) **
 in thread 7fddb4a64700 thread_name:md_log_replay
debug 2023-01-13 01:22:37.720 7fddb4a64700 -1 /builddir/build/BUILD/ceph-14.2.11/src/mds/journal.cc: In function 'void EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)' thread 7fddb4a64700 time 2023-01-13 01:22:37.721236
/builddir/build/BUILD/ceph-14.2.11/src/mds/journal.cc: 1551: FAILED ceph_assert(g_conf()->mds_wipe_sessions)

Resolution

The solution is to stop the MDS to abate the restart loop and apply mds_wipe_sessions to the Ceph Configuration Database. After the MDS comes up and is stable, the parameter mds_wipe_sessions must be removed to avoid unintended consequences at a later time.

To get started, open one Putty/Terminal session to run OpenShift administrative commands. Then, using KCS article #4628891, open a second Putty/Terminal session to the Ceph CLI. Once these 2 sessions are active, follow these steps.

Step by step:

  1. [OpenShift]: Scale the mds deployments down:
$ oc scale deployment -l app=rook-ceph-mds --replicas 0 -n openshift-storage
  1. [OpenShift]: Ensure pods have been terminated (shouldn't return any output):
$ oc get pods -n openshift-storage | grep mds
  1. [Ceph CLI]: Set mds_wipe_sessions equal to true and make sure the config option took:
$ ceph config set mds mds_wipe_sessions true
$ ceph config dump | grep mds_wipe
  1. [OpenShift]: Scale the following mds deployments up:
$ oc scale deployment -l app=rook-ceph-mds --replicas 1 -n openshift-storage
  1. [OpenShift]: Confirm mds pods are Running:
$ oc get pods -n openshift-storage | grep mds
  1. [Ceph CLI]: Once the MDS pods come back online, allow a minute to pass and then check the Ceph Status:
    (A status of HEALTH_OK and all PG being active+clean indicates success)
$ ceph -s
   cluster:
    id:     1803fxxx-Redacted-Cluster-ID-yyy63e87aaba
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 25h)
    mgr: a(active, since 25h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 25h), 3 in (since 25h)

  data:
    pools:   3 pools, 96 pgs
    objects: 153 objects, 375 MiB
    usage:   3.9 GiB used, 3.0 TiB / 3 TiB avail
    pgs:     96 active+clean  ***

  io:
    client:   853 B/s rd, 11 KiB/s wr, 1 op/s rd, 1 op/s wr
  1. [Ceph CLI]: Once ceph status output returns HEALTH_OK and all PGs active+clean clear mds_wipe_sessions:
$ ceph config rm mds mds_wipe_sessions
$ ceph config dump | grep mds_wipe
(should be no output)

Root Cause

If there is a corrupt MDS session entry, both MDS pods will enter into CLBO. At MDS startup, one of the initial steps is to reestablish all sessions, but due to one or more corrupt client session, the MDS will crash. RH has observed 3 use cases where this issue may occur.

  • The issue may occur post ODF/OCP upgrade.
  • The issue may occur if a Database workload is backed by Ceph FS / Ceph MDS. (See Diagnostic Steps).
  • The issue may occur after resolving the issue detailed in KCS article #6835851

Diagnostic Steps

SBR
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.