When trying to capture a vmcore from a RHEL cluster node it is fenced off before it completes the dump
Environment
- Red Hat Enterprise Linux (RHEL); including
- Red Hat Cluster Suite 4+ and Red Hat GFS 6.1
- Red Hat Enterprise Linux Server 5 (with the High Availability and Resilient Storage Add Ons)
- Red Hat Enterprise Linux Server 6 (with the High Availability and Resilient Storage Add Ons)
- Red Hat High Availability Cluster with 2 or more nodes
- Most common with power-fencing agents, but can affect fabric fencing if kdump or Diskdump is configured to dump the vmcore to SAN-based storage.
- Kdump (RHEL5/6) or Netdump/Diskdump (RHEL4) is configured on the cluster nodes.
Issue
- When trying to use
NetdumporDiskdumpthe cluster node is fenced before the capture has completed. - When trying to use
kdumpthe cluster node is fenced before the capture has completed.
Resolution
-
On RHEL6.2+, it is recommended that the
fence_kdumpfence agent is used as the primary fencing agent, and the normal fencing agent is used as a secondary fencing device.- If fence_kdump hears that kdump is preparing to dump a vmcore from the to-be-fenced node, it considers that fencing has succeeded and begins recovering services immediately.
- This typically allows services to recover faster than using the
post_fail_delaymethod below.
-
On RHEL4, RHEL5 and RHEL6 (especially prior to RHEL6.2), increase post_fail_delay to a value large enough so that the vmcore will be completely captured before the node is fenced.
- Although not mutually exclusive,
post_fail_delayusually should not be set when fence_kdump is used if it is solely to assist with vmcore capture. - The value of
post_fail_delaycan be adjusted depending on the time required to capture a complete vmcore. This depends on kdump configuration and and will be different for each cluster configuration. The value is in seconds.
<cluster> <fence_daemon post_fail_delay="300" post_join_delay="3"/> .... </cluster>-
As of
cman-2.0.115-34.el5and later,post_fail_delaycan be adjusted without restarting the cluster.- For RHEL4 and cman prior to
cman-2.0.115-34.el5, a cluster restart will be required to pick up the new changes.
- For RHEL4 and cman prior to
- Although not mutually exclusive,
Root Cause
-
It is known and expected behavior that cluster fencing can interrupt Kdump/Diskdump/Netdump from completing their export of the vmcore in the event of a kernel panic.
-
When a cluster node crashes, the cluster node will be evicted by the cluster and fenced. This behavior can interfere with
Kdump,NetdumpandDiskdump, which typically need more than the few seconds that the cluster provides (by default) to dump the vmcore before fencing. -
All cluster activities (such as fencing a cluster node, handing out new locks for
GFSorGFS2, and relocating cluster services) will be blocked until thepost_fail_delaytimer has completed, orfence_kdumpsends an acknowledgment to the cluster.- There is no risk for
GFSorGFS2corruption since new locks will not be granted until fencing is complete which occurs afterpost_fail_delaytimer has completed. fence_kdumpis considered superior to setting a largepost_fail_delaybecause kdump should notify the fencing-node as soon as kdump begins, which should be much earlier than running the full course of post_fail_delay (which needs to allow for dumping the whole vmcore).
- There is no risk for
Diagnostic Steps
This solution may apply if the following symptoms are present:
- You have a Red Hat High Availability cluster with 2 or more nodes.
- Your cluster uses a power-fencing agent (can also occur with fabric or scsi fencing if kdump uses FC-attached storage)
- One of the cluster nodes suffers a kernel panic, but either a partial or no vmcore is captured.
- The cluster node is fenced before the Kdump/Diskdump/Netdump completes dumping the vmcore.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.