What are the options post_join_delay and post_fail_delay used for in a RHEL cluster?

Solution Verified - Updated

Environment

  • Red Hat Cluster Suite 4+
  • Red Hat Enterprise Linux (RHEL) 5 or 6 with the High Availability Add On

Issue

  • Cluster nodes are fenced when a node joins the cluster.
  • Can a cluster node that will be evicted be delayed from fencing?
  • Why does my cluster take a long time to fence a node when post_fail_delay is set?

Resolution

To change these two parameters without restarting cman, you can refer to the following article: How do I update post_fail_delay or post_join_delay without restarting my cluster?

For more information on these options see the man page for fenced.

Root Cause

The post_join_delay and post_fail_delay is described in our cluster administration guide. Below is a description of these attributes.

  • post_join_delay: This is the number of seconds fenced will delay before actually fencing any victims after nodes join the domain. This delay gives nodes that have been tagged for fencing a chance to join the cluster and avoid being fenced. A delay of -1 here will cause the daemon to wait indefinitely for all nodes to join the cluster and no nodes will actually be fenced on startup. This attribute only applies when a node is joining a cluster, existing cluster members will not trigger the post_join_delay timer.

  • post_fail_delay: This is the number of seconds fenced will delay before actually fencing a domain member that has failed. A cluster node will not be fenced if it tries to rejoin the cluster before post_fail_delay completes and has been rebooted. The post_fail_delay is 0 by default to minimize the time that other systems are blocked from fencing. Increasing the value of `post_fail_delay is useful when a vmcore needs to be captured. If this attribute must be increased from the default, it is recommended to set it as low as possible while achieving the targeted behavior, as this setting will result in blocked activity and an inability to manage shared resources or carry out critical functions while the delay is ongoing.

    All cluster operations such as fencing a cluster node, handing out new locks for GFS or GFS2, and relocating services will be blocked until the post_fail_delay timer has completed. There is no risk for GFS or GFS2 corruption since new locks will not be granted until fencing is complete which occurs after post_fail_delay timer has completed.

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.