Cassandra sees extra nodes

Solution Verified - Updated 14 Jun 2024

Environment

Red Hat OpenShift Container Platform
- 3.6
- 3.9

Issue

Hawkular-metrics is down
metrics is not working and the URL is not accessible attached pods logs
currently we facing issue with cassandra and hawkular-metrics here, we expect cassandra and metric is working properly
first, we saw the metric pod not running and after we check log from cassandra pod
cassandra sees nodes that don't exist
hawkular-metrics is trying an invalid IP address when trying to connect to cassandra

Resolution

First, attempt to scale down all of the cassandra node(s), letting all of the pods stop, and then scaling them back up:

# Do this for all of the instances of cassandra you do have:
oc scale rc hawkular-cassandra-1 --replicas=0
oc scale rc hawkular-cassandra-2 --replicas=0
oc scale rc hawkular-cassandra-3 --replicas=0
# Wait for all three pods to stop completely
oc scale rc hawkular-cassandra-1 --replicas=1
oc scale rc hawkular-cassandra-2 --replicas=1
oc scale rc hawkular-cassandra-3 --replicas=1

This should help, but assuming there are still extra node(s) proceed on.

Identify the IP address(es) of the invalid cassandra node(s) by checking the health of the cassandra cluster
Using the built in assassinate command, remove the invalid cassandra node(s):

# Run this against any Valid cassandra pod name:
oc -n openshift-infra exec <cassandra pod> -- nodetool assassinate <bad IP address>
oc -n openshift-infra exec <cassandra pod> -- nodetool assassinate <bad IP address>
oc -n openshift-infra exec <cassandra pod> -- nodetool assassinate <bad IP address>

Finally, after force-ably removing the invalid cassandra node(s), scale hawkular-metrics down, wait for it to stop completely, and then scale backup:

oc scale rc hawkular-metrics --replicas=0
# Wait for the existing pod to stop completely
oc scale rc hawkular-metrics --replicas=1

Please note that after starting up the hawkular-metrics pod(s) after the above correcting steps, please provide a little time for everything to get started back up properly

Root Cause

For some, to be determined, reason, Cassandra believes there are more than the expected number of cassandra pods in its "cluster".
- This means that there is at least one extra IP address that hawkular-metrics will try that would cause it to fail, forcing hawkular-metrics into a loop.
Removing these extra cassandra node(s) and forcing a restart of hawkular-metrics (both of which described above) seems to force cassandra to only acknowledge real nodes and hakwular-metrics to only try valid endpoints.

Diagnostic Steps

When checking the Health of Cassandra, there would be additional nodes and IP addresses in the list.
- These will commonly be listed as DN, or Down and Normal
For example, the below cluster only has 3 cassandra pods, but it believes there are 6 nodes:

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns (effective)  Host ID                               Rack
DN  <IP>   286.99 KB  256          15.5%             <HOST_ID>                                rack1
DN  <IP>  299.33 KB  256          15.9%              <HOST_ID>                                 rack1
UN  <IP>  100.86 MB  256          18.5%              <HOST_ID>                                 rack1
DN  <IP>  249.1 KB   256          17.6%                 <HOST_ID>                                 rack1
UN  <IP> 76.98 MB   256          15.9%                <HOST_ID>                                 rack1
UN  <IP>  91.86 MB   256          16.6%                <HOST_ID>                                 rack1

SBR

Shift

Product(s)

Red Hat OpenShift Container Platform

Components

Category

Supportability

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.