RHEL High Availability cluster nodes on IBM z Systems experience STONITH-device timeouts around midnight on a nightly basis

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux (RHEL) 7, 8 with the High Availability Add-On
  • Cluster members are z/VM guests using fence_zvmpi SMAPI fencing

Issue

  • My z guest cluster-nodes report STONITH-device monitor timeouts around midnight
  • IBM z Systems guests are failing to fence - showing timeouts from the fence-device
  • z/VM guest fencing through SMAPI is slow or failing around midnight on a nightly basis

Resolution

Summary: Reduce z/VM directory-management-system's nightly sleep duration to something that is less likely to cause STONITH operation timeouts.

Steps:

  • Logon to z/VM as MAINT and issue the following command:

    DIRM SEND DIRMAINT DATADVH
    

    This will send a copy of the DIRMAINT DATADVH configuration file to your
    reader.

  • Receive the file to your A disk and edit the file with

    XEDIT DIRMAINT DATADVH A
    
  • Change the line that looks like

    ==/==/== 23:59:00 00/00/00 CP SLEEP 2 MIN
    

    to

    ==/==/== 23:59:55 00/00/00 CP SLEEP 10 SEC
    

    NOTE: There are two changes here - the start time (23:59:00 -> 23:59:55) and the duration of the sleep command (2 MIN -> 10 SEC).

  • Enter FILE to save and exit XEDIT.

  • After editing the file, use the DIRM FILE DIRMAINT DATADVH command to send the updated file back to the DIRMAINT service machine.

Alternative resolution: Adjust pacemaker STONITH-device op timeouts

Summary: Extend the length of operation timeouts on the fence_zvmip STONITH devices - at least the reboot and monitor ops, usually.

Steps: See A stonith device is failing to start and/or reporting "Timed Out" errors in a RHEL 6 or 7 High Availability cluster with pacemaker

Root Cause

By default, the Dirmaint user directory management system in z/VM sleeps for 2 minutes around midnight in order to satisfy limitations of z/VM's job scheduling system. This means that SMAPI requests issued around midnight can time out - and thus STONITH operations at that time may experience these timeouts.

Alternatively - if for some reason the dirmaint duration must remain as it is, then the cluster's STONITH devices can be configured with higher STONITH timeouts (solution 2 above).

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.