Why does a cluster node fence other nodes that haven't started cman yet when it initially joins the cluster in RHEL 4, 5, or 6?

Solution Unverified - Updated

Environment

  • Red Hat Enterprise Linux (RHEL) 5 or 6 with the High Availability Add On
  • Red Hat Cluster Suite (RHCS) 4
  • fencedevices defined in /etc/cluster/cluster.conf, and fenced enabled for use by the cluster (FENCE_JOIN="yes" or left unset in /etc/sysconfig/cman)
  • <fence_daemon clean_start="0"> or left unset in /etc/cluster/cluster.conf

Issue

  • When starting cman on node 1, it fences node 2 (which has not yet started cman).
  • Nodes that are not in the cluster are rebooted when other nodes join the cluster for the first time.
  • Two nodes in cluster got rebooted. Both nodes in same cluster and both nodes in the cluster sent a reboot signal to each other, one fencing the other first then the other when it was coming back up.
  • Why is my cluster node powering off a node that isn't in the cluster when it is booted up?

Resolution

  • If possible, start cman on all nodes at the same time, or boot them at the same time if it is chkconfig'd on.

  • Consider increasing post_join_delay in /etc/cluster/cluster.conf, if more time is needed before executing fencing startup, such as if nodes may be started in a staggered fashion, or take different amounts of time to boot up.

Root Cause

When a cluster membership forms for the first time and gains quorum, the nodes in that membership will fence any nodes that have not joined after post_join_delay seconds. This is necessary to ensure that any nodes which are not in communication have released any shared resources, and are reset back to a "known" state.

SBR
Components

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.