Cassandra pod failing with corruption errors in OCP 3

Solution Unverified - Updated 14 Jun 2024

Environment

Red Hat OpenShift Container Platform (RHOCP)
- 3.x

Issue

OpenShift metrics isn't working.
Cassandra and hawkular pods are throwing errors.

Cassandra has errors like the following:

Caused by: org.apache.cassandra.io.compress.CorruptBlockException: (/cassandra_data/data/hawkular_metrics/data_compressed-[HASH]/mc-113-big-Data.db): corruption detected, chunk at 1708577 of length 21043.

Cassandra fails with CorruptSSTableException:

INFO  03:47:25 Opening /cassandra_data/data/hawkular_metrics/data-RANDOM-STRING/FILENAME (841306 bytes)
ERROR 03:47:25 Exiting forcefully due to file system exception on startup, disk failure policy "stop"
org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.UTFDataFormatException: malformed input around byte 3
at org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:125) ~[apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2]
at org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:86) ~[apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2]
at org.apache.cassandra.io.util.CompressedSegmentedFile$Builder.metadata(CompressedSegmentedFile.java:142) ~[apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2]
at org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:101) ~[apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2]
at org.apache.cassandra.io.util.SegmentedFile$Builder.complete(SegmentedFile.java:186) ~[apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2]
at org.apache.cassandra.io.util.SegmentedFile$Builder.complete(SegmentedFile.java:178) ~[apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2]
at org.apache.cassandra.io.sstable.format.SSTableReader.load(SSTableReader.java:701) ~[apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2]
at org.apache.cassandra.io.sstable.format.SSTableReader.load(SSTableReader.java:662) ~[apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2]
at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:456) ~[apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2]
at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:361) ~[apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2]
at org.apache.cassandra.io.sstable.format.SSTableReader$4.run(SSTableReader.java:499) ~[apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_121]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
Caused by: java.io.UTFDataFormatException: malformed input around byte 3
at java.io.DataInputStream.readUTF(DataInputStream.java:634) ~[na:1.8.0_121]
at java.io.DataInputStream.readUTF(DataInputStream.java:564) ~[na:1.8.0_121]
at org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:101) ~[apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2]
... 15 common frames omitted

Resolution

This is an issue with Cassandra, so please try these steps.

Preparation

Move to openshift-infra project
```
$ oc project openshift-infra
```

Scale down all the replication controllers:

$ oc scale $(oc get rc -o name) --replicas=0

Wait until the pods are scaled down.
Open a debug pod for the cassandra replication controller, so you have full access to both Cassandra tools and PV
```
$ oc debug rc/hawkular-cassandra-1
```

Once you are in the debug pod shell, there are two solutions, so please try solution 1 first and, if it is not enough, solution 2.

Solution 1 - sstablescrub

This first solution takes advantage of the Content from docs.datastax.com is not included.sstablescrub command for cassandra. It attempts to rebuild the SSTable, and will throw away data that cannot recovered. To get the pod up the corrupt file needs to be either repaired or removed.

In the debug pod, please run this command, based off of the error messages that indicate what data is corrupted:
```
> sstablescrub hawkular_metrics data_compressed-[HASH]
```

Solution 2 - remove corrupted sstables files

If the above did not correct the issue, you should next try to manually delete the corrupted data.

In the debug pod, just remove the corrupted file. See the Diagnostic Steps section to find the corrupted file name. It like looks something like the below:
```
> rm /cassandra_data/data/hawkular_metrics/data_compressed-[HASH]/mc-[NUMBER]-big-Data.db 
```

Afterwards

After having applied either of the solutions, you need to start the pods again

$ oc scale $(oc get rc -o name) --replicas=1

Root Cause

Cassandra data file could be corrupted with force terminate such as poweroff or forced shutdown/kill after graceful termination period specified in terminationGracePeriodSeconds.

Diagnostic Steps

Check the Cassandra pods:

$ oc project openshift-infra
$ oc get pods -n openshift-infra
[...]

Check the logs from the Cassandra pod:

$ oc logs [hawkular-cassandra-pod_name]

Search for errors like the following and get the name of the corrupted file:

Caused by: org.apache.cassandra.io.compress.CorruptBlockException: (/cassandra_data/data/hawkular_metrics/data_compressed-[HASH]/mc-[NUMBER]-big-Data.db): corruption detected, chunk at 1708577 of length 21043.

SBR

Shift

Product(s)

Red Hat OpenShift Container Platform

Components

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.