RHEL High Availability cluster nodes on IBM z Systems experience STONITH-device timeouts around midnight on a nightly basis

Solution Verified - Updated 14 Jun 2024

Environment

Red Hat Enterprise Linux (RHEL) 7, 8 with the High Availability Add-On
Cluster members are z/VM guests using fence_zvmpi SMAPI fencing

Issue

My z guest cluster-nodes report STONITH-device monitor timeouts around midnight
IBM z Systems guests are failing to fence - showing timeouts from the fence-device
z/VM guest fencing through SMAPI is slow or failing around midnight on a nightly basis

Resolution

Recommended resolution: Change z/VM Dirmaint sleep duration

Summary: Reduce z/VM directory-management-system's nightly sleep duration to something that is less likely to cause STONITH operation timeouts.

Steps:

Logon to z/VM as MAINT and issue the following command:
```
DIRM SEND DIRMAINT DATADVH
```
This will send a copy of the DIRMAINT DATADVH configuration file to your
reader.
Receive the file to your A disk and edit the file with
```
XEDIT DIRMAINT DATADVH A
```
Change the line that looks like
```
==/==/== 23:59:00 00/00/00 CP SLEEP 2 MIN
```
to
```
==/==/== 23:59:55 00/00/00 CP SLEEP 10 SEC
```
NOTE: There are two changes here - the start time (23:59:00 -> 23:59:55) and the duration of the sleep command (2 MIN -> 10 SEC).
Enter FILE to save and exit XEDIT.
After editing the file, use the DIRM FILE DIRMAINT DATADVH command to send the updated file back to the DIRMAINT service machine.

Alternative resolution: Adjust pacemaker STONITH-device op timeouts

Summary: Extend the length of operation timeouts on the fence_zvmip STONITH devices - at least the reboot and monitor ops, usually.

Steps: See A stonith device is failing to start and/or reporting "Timed Out" errors in a RHEL 6 or 7 High Availability cluster with pacemaker

Root Cause

By default, the Dirmaint user directory management system in z/VM sleeps for 2 minutes around midnight in order to satisfy limitations of z/VM's job scheduling system. This means that SMAPI requests issued around midnight can time out - and thus STONITH operations at that time may experience these timeouts.

Alternatively - if for some reason the dirmaint duration must remain as it is, then the cluster's STONITH devices can be configured with higher STONITH timeouts (solution 2 above).

SBR

Clusterha

Product(s)

Red Hat Enterprise Linux

Components

cluster

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.