Redis failover causes Event-Driven Ansible activation failures

Updated 30 Sept 2024

Issue description

When a primary Redis node enters a 'failed' state and a new primary node is promoted, the Event-Driven Ansible workers and scheduler are unable to re-connect to the Redis cluster. This causes activations to fail until the containers or pods are recycled.

Use the procedures in this article to manually recycle the containers or pods.

IMPORTANT: The scheduler container or pod must be restarted before the workers.

Containerized deployment workaround

To enable Event-Driven Ansible activations to resume in a containerized deployment, the following containers must be manually stopped and restarted.

NOTE: Using the podman restart command does not resolve the issue because it only stops the containers and does not perform the restart action. Therefore, you must use the podman stop and podman start process outlined in these steps.

Stop and restart the Event-Driven Ansible scheduler container by entering the following commands:
NOTE: The scheduler container must be restarted first.

$ podman stop automation-eda-scheduler
$ podman start automation-eda-scheduler

Stop and restart the Event-Driven Ansible worker containers by entering the following commands:

$ podman stop automation-eda-worker-1
$ podman start automation-eda-worker-1
$ podman stop automation-eda-worker-2
$ podman start automation-eda-worker-2

Stop and restart the Event-Driven Ansible activation worker containers by entering the following commands:

$ podman stop automation-eda-activation-worker-1
$ podman start automation-eda-activation-worker-2

Operator-based deployment workaround

In an Openshift Container Platform deployment, pods are automatically rebuilt when they are deleted. So, to enable Event-Driven Ansible activations to resume in an operator-based deployment, you must identify the scheduler and worker pods and delete them.

Identify the scheduler and worker pods in the Openshift Container Platform console by navigating to Workloads → Pods and locating the Event-Driven Ansible pods, for example:

aap-eda-scheduler-<UID>
aap-eda-scheduler-<UID>
aap-eda-worker-<UID>
aap-eda-worker-<UID>
aap-eda-activation-worker-<UID>
aap-eda-activation-worker-<UID>

where is the unique identifier automatically generated for each of the pods.

Delete the Event-Driven Ansible scheduler pods by entering the following commands:
NOTE: The scheduler pod must be restarted first.

$ oc delete pod automation-eda-scheduler-<UID>
$ oc delete pod automation-eda-scheduler-<UID>

Delete the Event-Driven Ansible worker pods by entering the following commands:

$ oc delete pod automation-eda-worker-<UID>
$ oc delete pod automation-eda-worker-<UID>

Delete the Event-Driven Ansible activation worker pods by entering the following commands:

$ oc delete pod aap-eda-activation-worker-<UID>
$ oc delete pod aap-eda-activation-worker-<UID>

SBR

Ansible

Product(s)

Red Hat Ansible Automation Platform

Category

Troubleshoot

Components

redis

Tags

Article Type

General