Hawkular-Metrics has an error trying to connect to cassandra with a NullPointerException

Solution Verified - Updated 14 Jun 2024

Environment

Red Hat OpenShift Container Platform
- 3.6
- 3.6
- 3.9

Issue

Cassandra starts up fine but hawkular-metrics fails to connect to cassandra:

Error: FATAL [org.hawkular.metrics.api.jaxrs.MetricsServiceLifecycle] (metricsservice-lifecycle-thread) HAWKMETRICS200006: An error occurred trying to connect to the Cassandra cluster: java.lang.RuntimeException: java.lang.NullPointerException
	at org.hawkular.metrics.api.jaxrs.DistributedLock.lockAndThen(DistributedLock.java:111)
	at org.hawkular.metrics.api.jaxrs.DistributedLock.lockAndThen(DistributedLock.java:95)
	at org.hawkular.metrics.api.jaxrs.MetricsServiceLifecycle.initJobsService(MetricsServiceLifecycle.java:711)

hawkular-metrics crashes after failing to connect to cassandra but cassandra looks fine.

Resolution

The first step is to identify the broken rows of data in the cassandra database:

$ oc -n openshift-infra exec <cassandra_pod> -- cqlsh --ssl -e "select * from hawkular_metrics.scheduled_jobs_idx"

$ oc -n openshift-infra exec <cassandra_pod> -- cqlsh --ssl -e "select * from hawkular_metrics.finished_jobs_idx"

These commands will return something along the lines of:

    # oc -n openshift-infra exec <cassandra_pod> -- cqlsh --ssl -e "select * from hawkular_metrics.scheduled_jobs_idx"
    
    time_slice               | job_id                               | job_name               | job_params | job_type               | status | trigger
    --------------------------+--------------------------------------+------------------------+------------+------------------------+--------+----------------------------------------------------------------------------------------------------------------------
    <DATETIME_1> | 0633a5ef-d0ac-480d-9c8c-4d08fd566d4c |                   null |       null |                   null |      1 |                                                                                                                 null
    <DATETIME_2> | 37f2e65a-f40e-461b-ab0a-d19091c9235e |     TEMP_TABLE_CREATOR |           {} |     TEMP_TABLE_CREATOR |   null |   {type: 1, trigger_time: 1523387340000, delay: 60000, interval: 7200000, repeat_count: null, execution_count: null}
    <DATETIME_3> | 20c4d9ff-bbc9-4f55-bd32-32dac26cbe00 |   TEMP_DATA_COMPRESSOR |           {} |   TEMP_DATA_COMPRESSOR |   null |   {type: 1, trigger_time: 1523394000000, delay: 60000, interval: 7200000, repeat_count: null, execution_count: null}
    <DATETIME_4> | 0633a5ef-d0ac-480d-9c8c-4d08fd566d4c |          COMPRESS_DATA |           {} |          COMPRESS_DATA |      1 |   {type: 1, trigger_time: 1494601200000, delay: 60000, interval: 7200000, repeat_count: null, execution_count: null}
    <DATETIME_5> | 0633a5ef-d0ac-480d-9c8c-4d08fd566d4c |          COMPRESS_DATA |           {} |          COMPRESS_DATA |      1 |   {type: 1, trigger_time: 1494608400000, delay: 60000, interval: 7200000, repeat_count: null, execution_count: null}
    <DATETIME_6> | 0633a5ef-d0ac-480d-9c8c-4d08fd566d4c |          COMPRESS_DATA |           {} |          COMPRESS_DATA |   null |   {type: 1, trigger_time: 1494615600000, delay: 60000, interval: 7200000, repeat_count: null, execution_count: null}
    <DATETIME_7> | bd09bfc5-753e-4369-9079-94cfb9cb0ebe | DELETE_EXPIRED_METRICS |           {} | DELETE_EXPIRED_METRICS |   null | {type: 1, trigger_time: 1509580800000, delay: 60000, interval: 604800000, repeat_count: null, execution_count: null}
    
    (7 rows)
    # oc -n openshift-infra exec <cassandra_pod> -- cqlsh --ssl -e "select * from hawkular_metrics.finished_jobs_idx"

    time_slice               | job_id
    --------------------------+--------------------------------------
    <DATETIME_8> | 0633a5ef-d0ac-480d-9c8c-4d08fd566d4c
    <DATETIME_9> | 0633a5ef-d0ac-480d-9c8c-4d08fd566d4c
    
    (2 rows)

Looking at the output from these commands you can see there is a row of data with null for most of the details. This is the problem row! So the next step is to delete it:

    $ oc -n openshift-infra exec <cassandra pod> -- cqlsh --ssl -e "delete from hawkular_metrics.scheduled_jobs_idx where time_slice = ' <DATETIME_1>' and job_id = 0633a5ef-d0ac-480d-9c8c-4d08fd566d4c"

    $ oc -n openshift-infra exec <cassandra pod> -- cqlsh --ssl -e "delete from hawkular_metrics.finished_jobs_idx where time_slice = ' <DATETIME_1>' and job_id = 0633a5ef-d0ac-480d-9c8c-4d08fd566d4c"

Finally, scale hawkular-metrics down and then back up to let it reconnect to cassandra and get started again:
```
# oc scale rc hawkular-metrics --replicas=0
# oc scale rc hawkular-metrics --replicas=1
```

Note:

This is only a workaround and might occur again in the future. Unfortunately there has been little to no traction on resolving this issue long term via the This content is not included.known bug. As such, you might need to run this mitigation again.

SBR

Shift

Product(s)

Red Hat OpenShift Container Platform

Components

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.