Cassandra sees extra nodes
Environment
- Red Hat OpenShift Container Platform
- 3.6
- 3.9
Issue
- Hawkular-metrics is down
- metrics is not working and the URL is not accessible attached pods logs
- currently we facing issue with cassandra and hawkular-metrics here, we expect cassandra and metric is working properly
- first, we saw the metric pod not running and after we check log from cassandra pod
- cassandra sees nodes that don't exist
- hawkular-metrics is trying an invalid IP address when trying to connect to cassandra
Resolution
- First, attempt to scale down all of the cassandra node(s), letting all of the pods stop, and then scaling them back up:
# Do this for all of the instances of cassandra you do have:
oc scale rc hawkular-cassandra-1 --replicas=0
oc scale rc hawkular-cassandra-2 --replicas=0
oc scale rc hawkular-cassandra-3 --replicas=0
# Wait for all three pods to stop completely
oc scale rc hawkular-cassandra-1 --replicas=1
oc scale rc hawkular-cassandra-2 --replicas=1
oc scale rc hawkular-cassandra-3 --replicas=1
- This should help, but assuming there are still extra node(s) proceed on.
- Identify the IP address(es) of the invalid cassandra node(s) by checking the health of the cassandra cluster
- Using the built in
assassinatecommand, remove the invalid cassandra node(s):
# Run this against any Valid cassandra pod name:
oc -n openshift-infra exec <cassandra pod> -- nodetool assassinate <bad IP address>
oc -n openshift-infra exec <cassandra pod> -- nodetool assassinate <bad IP address>
oc -n openshift-infra exec <cassandra pod> -- nodetool assassinate <bad IP address>
- Finally, after force-ably removing the invalid cassandra node(s), scale hawkular-metrics down, wait for it to stop completely, and then scale backup:
oc scale rc hawkular-metrics --replicas=0
# Wait for the existing pod to stop completely
oc scale rc hawkular-metrics --replicas=1
Please note that after starting up the hawkular-metrics pod(s) after the above correcting steps, please provide a little time for everything to get started back up properly
Root Cause
- For some, to be determined, reason, Cassandra believes there are more than the expected number of cassandra pods in its "cluster".
- This means that there is at least one extra IP address that hawkular-metrics will try that would cause it to fail, forcing hawkular-metrics into a loop.
- Removing these extra cassandra node(s) and forcing a restart of hawkular-metrics (both of which described above) seems to force cassandra to only acknowledge real nodes and hakwular-metrics to only try valid endpoints.
Diagnostic Steps
-
When checking the Health of Cassandra, there would be additional nodes and IP addresses in the list.
- These will commonly be listed as
DN, orDown and Normal
- These will commonly be listed as
-
For example, the below cluster only has 3 cassandra pods, but it believes there are 6 nodes:
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
DN <IP> 286.99 KB 256 15.5% <HOST_ID> rack1
DN <IP> 299.33 KB 256 15.9% <HOST_ID> rack1
UN <IP> 100.86 MB 256 18.5% <HOST_ID> rack1
DN <IP> 249.1 KB 256 17.6% <HOST_ID> rack1
UN <IP> 76.98 MB 256 15.9% <HOST_ID> rack1
UN <IP> 91.86 MB 256 16.6% <HOST_ID> rack1
SBR
Product(s)
Components
Category
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.