OCS / ODF Database Workloads Must Not Use CephFS PVs/PVCs (RDBMSs, NoSQL, PostgreSQL, Mongo DBs, etc.)

Solution Verified - Updated

Environment

  • Red Hat OpenShift Data Foundations (RHODF) Version 4.x
  • Red Hat OpenShift Container Storage (RHOCS) Version 4.x

Issue

CephFS is intended to be a distributed filesystem to certain workloads. There were many issues noted with customers when running database applications using PVs based in CephFS.

In some scenarios, administrators were also taking snapshots of CephFS PV running databases. This is a corner case that is difficult to hit, but when hit, it can cause severe impact (complete CephFS service outage).

The impact includes damage to the metadata of files used by the database application (only file metadata is damaged, not the data itself). This can cause both Ceph MDS pods to crash when the database application is deleting a file with damaged metadata. In this scenario, Ceph can no longer serve any IO for CephFS volumes, causing loss of data access to CephFSs (at least one MDS (Meta Data Server) pod must be up and running to serve CephFS IO).

Resolution

It is not supported to use CephFS PVs with database application workloads.

Instead, you can use ocs-storagecluster-ceph-rbd PVs, which supports both block and filesystem mode.

If you are already using CephFS PV for database application workloads, we need that you migrate the data to a ocs-storagecluster-ceph-rbd PV. The following document will help here: How to migrate data between PVs in OpenShift 4.

In the above scenario, since the database was already configured with a mount point, meaning it uses a filesystem, you will migrate to a ceph-rbd volume with volumeMode=filesystem, otherwise, with volumeMode=block you will be creating a raw block device and you will have to use DB/application tooling mechanisms to dump and restore the data into the raw block device.

In case of damaged files (remember only metadata is damaged), these damaged files can be read/copied from the problematic CephFS PV object. This is safe, this read action won't trigger MDS pods to crash (only a delete operation of damaged files trigger the MDS crash).

Root Cause

Actual root cause is still under investigation and it is not yet fully clear what exactly triggers the metadata corruption. Our storage engineering team is actively working to get root cause identified and corrected.

This issue has not been seen so far when using other applications, or when the database application is using ocs-storagecluster-ceph-rbd PV.

Here are some details about this issue:

  • This is a very low-probability event wherein internal CephFS metadata gets corrupted. Despite our best efforts it’s never been replicated in a controlled environment and the only commonality we have identified is that it occurs exclusively with database application log files. We consider it unlikely (but not impossible) that taking snapshots or increasing the load on the system contributes to this situation.

  • The corruption does not immediately cause the MDS crash. Rather, when the log file is unlinked (ie, deleted), the corrupted metadata is consulted and detected as invalid, which breaks the system

  • Specifically, every inode in the system records the snapshot range it is a part of, by means of a “first” and “last” member. This corruption involved the “first” snapid being set to a nonsensical value — frequently to the value of another inode in the system. So we believe there’s invalid memory access happening somewhere we can’t find, but only under a very specific workload that database applications somehow trigger with very low probability.

  • Patches are in-progress that will prevent the mds from persisting the corrupted metadata to durable storage. Instead, it will detect the invalid value, crash, and log data we hope will let us track down the circumstances leading to this bug. More importantly to impacted users, this means the mds will restart and IO will continue — because it’s a low-probability race of some kind, it won’t recur when the operation is retried, and corrupted metadata will not get persisted in a way it breaks the system

Related internal bugs

This content is not included.Bug 2071592 - [GSS][ceph-fs] Fail to mount any cephfs volume : 1 filesystem is offline - data access recovery

This content is not included.Bug 2111352 - cephfs: tooling to identify inode (metadata) corruption

This content is not included.Bug 2162312 - [GSS][ODF 4.10.9] mds pods in constant CrashLoopBackOff with FAILED ceph_assert - /src/mds/Server.cc: In function 'void Server::_unlink_local

SBR
Category
Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.