Azure VM running as a RHEL High Availability cluster member take a very long time to be fenced, or fencing fails / times-out before the VM shuts down

Solution Verified - Updated 14 Jun 2024

Environment

Red Hat Enterprise Linux Server 7, 8 (with the High Availability Add On)
pacemaker
Microsoft Azure VMs as cluster members

Issue

I am running a RHEL HA cluster on Microsoft Azure VMs. When a node gets fenced it takes a very long time to complete.
Fencing of RHEL cluster nodes on Azure with fence_azure_arm timing out or taking a very long time

Resolution

With the errata below, the fencing agent fence_azure_arm no longer has a long delay in fencing a VM which could result in the fencing failing because the VM was gracefully shutdown. The VM is no longer gracefully shutdown, but is hard killed. For information on this then see the following article: Content from azure.microsoft.com is not included.Azure Virtual Machine PowerOff now available with fast shutdown

Red Hat Enterprise Linux 7

The issue (bz1709110) has been resolved with errata RHBA-2019:1345 with the following package(s): fence-agents-4.2.1-11.el7_6.8 or later.
The issue (bz1709109) has been resolved with errata RHEA-2019:1320 with the following package(s): fence-agents-4.0.11-86.el7_5.8 or later for RHEL 7.5.z releases.
The issue (bz1709108) has been resolved with errata RHEA-2019:1501 with the following package(s): fence-agents-4.0.11-66.el7_4.12 or later for RHEL 7.4.z releases.
#####Red Hat Enterprise Linux 8
The issue (bz1700546) has been resolved with errata RHBA-2019:3326 with the following package(s): fence-agents-4.2.1-30.el8 or later for RHEL 8.1.z releases.

Workaround

Adjust the stonith-device's timeout and retry attributes to values that accommodate a longer shutdown and response time for a VM. Red Hat recommends accommodating at least a 3+ minute reboot time, and possibly higher if observed in the environment in question.

In the stonith-device's settings, set the pcmk_reboot_timeout value to higher than the entire time observed it takes to reboot; set power_timeout to at least half that total observed time; and consider setting a pcmk_reboot_retries value > 0 so that the operation is retried at least multiple times. For example, to update an existing device to handle an observed 3-minute timeout:
```
# pcs stonith update azure-stonith pcmk_reboot_timeout=480 power_timeout=240 pcmk_reboot_retries=4 
```
Consider running test scenarios that match real-world load conditions and measure how long it takes the fence agent to reboot the VM entirely. An easy way to do this can be a manual test run of the fence_azure_arm agent using very high timeout settings, and simply time how long it takes to complete. NOTE: This will reboot the VM in question. Example:
```
# fence_azure_arm -l e04a6a49-9f00-xxxx-xxxx-a8bdda4af447 -p z/a05AwCN0IzAjVwXXXXXXXEWIoeVp0xg7QT//JE= --resourceGroup azrhelclirsgrp --tenantId 77ecefb6-cff0-XXXX-XXXX-757XXXX9485 --subscriptionId XXXXXXXX-38b4-4527-XXXX-012d49dfc02c -o reboot --power_timeout=6000
```

Root Cause

The Azure API that RHEL High Availability fence_azure_arm agent uses to shutdown a VM uses a method that may take an excessive amount of time to complete under some circumstances.

This is not observed in all deployments, but it does occur in some situations. Even longer runtimes may be tied to load in the environment and other unpredictable conditions.

Red Hat is working with Microsoft on potential solutions to address this problem. Please contact Red Hat Support if you are or may be affected by this problem.

SBR

Clusterha

Product(s)

Red Hat Enterprise Linux

Components

cluster

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.