What are the options post_join_delay and post_fail_delay used for in a RHEL cluster?
Environment
- Red Hat Cluster Suite 4+
- Red Hat Enterprise Linux (RHEL) 5 or 6 with the High Availability Add On
Issue
- Cluster nodes are fenced when a node joins the cluster.
- Can a cluster node that will be evicted be delayed from fencing?
- Why does my cluster take a long time to fence a node when
post_fail_delayis set?
Resolution
To change these two parameters without restarting cman, you can refer to the following article: How do I update post_fail_delay or post_join_delay without restarting my cluster?
For more information on these options see the man page for fenced.
Root Cause
The post_join_delay and post_fail_delay is described in our cluster administration guide. Below is a description of these attributes.
-
post_join_delay: This is the number of seconds fenced will delay before actually fencing any victims after nodes join the domain. This delay gives nodes that have been tagged for fencing a chance to join the cluster and avoid being fenced. A delay of -1 here will cause the daemon to wait indefinitely for all nodes to join the cluster and no nodes will actually be fenced on startup. This attribute only applies when a node is joining a cluster, existing cluster members will not trigger thepost_join_delaytimer. -
post_fail_delay: This is the number of seconds fenced will delay before actually fencing a domain member that has failed. A cluster node will not be fenced if it tries to rejoin the cluster beforepost_fail_delaycompletes and has been rebooted. Thepost_fail_delayis 0 by default to minimize the time that other systems are blocked from fencing. Increasing the value of `post_fail_delay is useful when a vmcore needs to be captured. If this attribute must be increased from the default, it is recommended to set it as low as possible while achieving the targeted behavior, as this setting will result in blocked activity and an inability to manage shared resources or carry out critical functions while the delay is ongoing.All cluster operations such as fencing a cluster node, handing out new locks for
GFSorGFS2, and relocating services will be blocked until thepost_fail_delaytimer has completed. There is no risk forGFSorGFS2corruption since new locks will not be granted until fencing is complete which occurs afterpost_fail_delaytimer has completed.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.