RHEL High Availability cluster nodes on IBM z Systems experience STONITH-device timeouts around midnight on a nightly basis
Environment
- Red Hat Enterprise Linux (RHEL) 7, 8 with the High Availability Add-On
- Cluster members are z/VM guests using
fence_zvmpiSMAPI fencing
Issue
- My z guest cluster-nodes report STONITH-device monitor timeouts around midnight
- IBM z Systems guests are failing to fence - showing timeouts from the fence-device
- z/VM guest fencing through SMAPI is slow or failing around midnight on a nightly basis
Resolution
Recommended resolution: Change z/VM Dirmaint sleep duration
Summary: Reduce z/VM directory-management-system's nightly sleep duration to something that is less likely to cause STONITH operation timeouts.
Steps:
-
Logon to z/VM as
MAINTand issue the following command:DIRM SEND DIRMAINT DATADVHThis will send a copy of the DIRMAINT DATADVH configuration file to your
reader. -
Receive the file to your A disk and edit the file with
XEDIT DIRMAINT DATADVH A -
Change the line that looks like
==/==/== 23:59:00 00/00/00 CP SLEEP 2 MINto
==/==/== 23:59:55 00/00/00 CP SLEEP 10 SECNOTE: There are two changes here - the start time (23:59:00 -> 23:59:55) and the duration of the sleep command (2 MIN -> 10 SEC).
-
Enter FILE to save and exit XEDIT.
-
After editing the file, use the DIRM FILE DIRMAINT DATADVH command to send the updated file back to the DIRMAINT service machine.
Alternative resolution: Adjust pacemaker STONITH-device op timeouts
Summary: Extend the length of operation timeouts on the fence_zvmip STONITH devices - at least the reboot and monitor ops, usually.
Root Cause
By default, the Dirmaint user directory management system in z/VM sleeps for 2 minutes around midnight in order to satisfy limitations of z/VM's job scheduling system. This means that SMAPI requests issued around midnight can time out - and thus STONITH operations at that time may experience these timeouts.
Alternatively - if for some reason the dirmaint duration must remain as it is, then the cluster's STONITH devices can be configured with higher STONITH timeouts (solution 2 above).
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.