fence_kdump fails with "timeout after X seconds" in a RHEL 6 or 7 High Availability cluster with kexec-tools versions older than 2.0.14

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 6 or 7 (with the High Availability Add-on)
  • kexec-tools package versions prior to 2.0.14-17.el7
  • fence_kdump

Issue

  • I need to capture a vmcore from a cluster node, but fence_kdump times out every time that node crashes.
  • fence_kdump is failing with "timeout after 60 seconds" and the node gets fenced before the core is done
  • If I test fence_kdump by panicking a node, fence_kdump fails with a time out error. If I take the node out of the cluster and panic it, it dumps a core successfully.

Resolution

A fix was included in the following errata below where the cluster node list is fetched from pacemaker configuration without fence_kdump_nodes specified in kdump.conf.

Red Hat Enterprise Linux 7


Upgrade to [`kexec-tools-2.0.14-17.el7`](/errata/RHBA-2017:2300) or newer.

The fix was also backported to kexec-tools-2.0.7-50.el7_3.2 and kexec-tools-2.0.7-38.el7_2.2.

Workaround for prior versions of

kexec-tools
Option fence_kdump_nodes should be specified in /etc/kdump.conf as shown below. fence_kdump_nodes takes as an argument a space-separated list of IP addresses/hostnames of other cluster nodes, excluding the IP address/hostname of the local node.

Example of fence_kdump_nodes configuration in /etc/kdump.conf for a three node cluster consisting of nodes node1, node2 and node3:
On node1:

fence_kdump_nodes node2 node3

On node2:

fence_kdump_nodes node1 node3

On node3:

fence_kdump_nodes node1 node2

After change of /etc/kdump.conf the kdump image must be regenerated. This can be done by restarting the kdump service using command below (which will take some time).

systemctl restart kdump

You can verify that new kdump was generated by looking into /boot directory at kdump image file time.

# ls -l /boot/*kdump*

If the kdump image was not recreated after restart of the kdump service, you can add option force_rebuild 1 to /etc/kdump.conf and restart the kdump service again. After verifying that this resulted in new kdump image to be generated comment out the force_rebuild 1 option in /etc/kdump.conf to prevent generation of kdump image on every kdump service start up.

Root Cause

  • This knowledge solution is intended as a guide to diagnosing this problem with fence_kdump for the purposes of resolving it. The cause of such an issue can vary, so please follow the Diagnostic Steps below to identify the cause of the problem, then follow the suggestions given, search for additional content in the Red Hat Customer Portal relating to those findings, or consult with Red Hat Support for additional guidance.
  • A bug : 1444688 is now opened for RHEL7, where the cluster node list is not obtained from pacemaker configuration .

Diagnostic Steps

  • Determine if kdump is configured on the server in question. If not, then configure it.

    • Look for crashkernel on the kernel command line with cat /proc/cmdline. This should be present for kexec to work properly.
    • Check that the kdump service is enabled to start on boot, and has successfully started.
    • Consult with the setup guide for kdump to confirm all necessary aspects have been configured
  • Capture the kdump initrd from the server that is panicking, and analyze its contents to see if it includes fence_kdump_send (and possibly go further to dig into its scripts and see if fence_kdump_send is being called to communicate with the correct nodes`.

  • Stop the cluster on one node, then test panicking that node to see whether kexec boots and dumps a core on the node that was stopped. The goal is to determine if kdump works when the cluster is not interfering by fencing it. If kdump doesn't work to begin with, then fence_kdump should be expected to continue to fail, and more work is needed to get kdump operational.

    • Prior to panicking the node, it is useful to connect to the console of that server so the output can be watched. If there is a problem with kexec loading or dumping a core, then it should be reflected here. Consult with other content in the Red Hat Customer Portal, or with Red Hat Support, if you are unsure about how to connect to the console.
    • Test kdump by panicking the node with SysRq+C.
    • If the server shows a kernel panic message and backtrace on the screen but nothing else happens for several minutes, then kexec did not work and the server should be rebooted by power-cycling it.
    • After the system has finished its panic test and hopefully booted kexec properly and dumped a core, let the server boot back up, log into it, then go and check the dump location. By default this is /var/crash on that server, but it can be configured for other locations in /etc/kdump.conf, so check there to see where the dump would be. The goal is to see if there is a directory in this location corresponding to the date/time of the panic that was tested. If there is, and there is a vmcore in that directory, then we know that kexec did its job.
  • If kexec/kdump is confirmed to be working, yet fence_kdump still does not succeed, then there is most likely a communication issue between the host that issues the fence_kdump command and the host that is panicking.

    • Check any firewalls between these hosts to see if something could be blocking the communications. Default port is 7410 over UDP.
    • Determine if there are any special settings in the panicking node's /etc/kdump.conf that could interfere with the ability of that host to bring up a network connection in kexec
    • Try to diagnose if any special steps or drivers or similar are needed to enable networking for this host in kexec. This may require additional investigation with the help of Red Hat Support.
      + While the host is panicking during tests, try to connect to its listen port (7410 by default) using nc or other utilities to test if there is connectivity.
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.