fence_kdump fails with "timeout after X seconds" in a RHEL 6 or 7 High Availability cluster with kexec-tools versions older than 2.0.14
Environment
- Red Hat Enterprise Linux 6 or 7 (with the High Availability Add-on)
kexec-toolspackage versions prior to2.0.14-17.el7fence_kdump
Issue
- I need to capture a vmcore from a cluster node, but
fence_kdumptimes out every time that node crashes. fence_kdumpis failing with "timeout after 60 seconds" and the node gets fenced before the core is done- If I test
fence_kdumpby panicking a node,fence_kdumpfails with a time out error. If I take the node out of the cluster and panic it, it dumps a core successfully.
Resolution
A fix was included in the following errata below where the cluster node list is fetched from pacemaker configuration without fence_kdump_nodes specified in kdump.conf.
Red Hat Enterprise Linux 7
Upgrade to [`kexec-tools-2.0.14-17.el7`](/errata/RHBA-2017:2300) or newer.
The fix was also backported to kexec-tools-2.0.7-50.el7_3.2 and kexec-tools-2.0.7-38.el7_2.2.
Workaround for prior versions of
kexec-tools
Option fence_kdump_nodes should be specified in /etc/kdump.conf as shown below. fence_kdump_nodes takes as an argument a space-separated list of IP addresses/hostnames of other cluster nodes, excluding the IP address/hostname of the local node.
Example of fence_kdump_nodes configuration in /etc/kdump.conf for a three node cluster consisting of nodes node1, node2 and node3:
On node1:
fence_kdump_nodes node2 node3
On node2:
fence_kdump_nodes node1 node3
On node3:
fence_kdump_nodes node1 node2
After change of /etc/kdump.conf the kdump image must be regenerated. This can be done by restarting the kdump service using command below (which will take some time).
systemctl restart kdump
You can verify that new kdump was generated by looking into /boot directory at kdump image file time.
# ls -l /boot/*kdump*
If the kdump image was not recreated after restart of the kdump service, you can add option force_rebuild 1 to /etc/kdump.conf and restart the kdump service again. After verifying that this resulted in new kdump image to be generated comment out the force_rebuild 1 option in /etc/kdump.conf to prevent generation of kdump image on every kdump service start up.
Related issues
- Solution 4498151 - fence_kdump times out when fence_kdump_nodes is not specified with kexec-tools version 2.0.15 or later
- Solution 4499751 - fence_kdump times out when cluster node names do not match hostnames
Root Cause
- This knowledge solution is intended as a guide to diagnosing this problem with
fence_kdumpfor the purposes of resolving it. The cause of such an issue can vary, so please follow the Diagnostic Steps below to identify the cause of the problem, then follow the suggestions given, search for additional content in the Red Hat Customer Portal relating to those findings, or consult with Red Hat Support for additional guidance. - A bug : 1444688 is now opened for RHEL7, where the cluster node list is not obtained from pacemaker configuration .
Diagnostic Steps
-
Determine if
kdumpis configured on the server in question. If not, then configure it.- Look for
crashkernelon the kernel command line withcat /proc/cmdline. This should be present forkexecto work properly. - Check that the
kdumpservice is enabled to start on boot, and has successfully started. - Consult with the setup guide for
kdumpto confirm all necessary aspects have been configured
- Look for
-
Capture the
kdumpinitrdfrom the server that is panicking, and analyze its contents to see if it includesfence_kdump_send(and possibly go further to dig into its scripts and see iffence_kdump_sendis being called to communicate with the correct nodes`. -
Stop the cluster on one node, then test panicking that node to see whether
kexecboots and dumps a core on the node that was stopped. The goal is to determine ifkdumpworks when the cluster is not interfering by fencing it. Ifkdumpdoesn't work to begin with, thenfence_kdumpshould be expected to continue to fail, and more work is needed to getkdumpoperational.- Prior to panicking the node, it is useful to connect to the console of that server so the output can be watched. If there is a problem with
kexecloading or dumping a core, then it should be reflected here. Consult with other content in the Red Hat Customer Portal, or with Red Hat Support, if you are unsure about how to connect to the console. - Test kdump by panicking the node with SysRq+C.
- If the server shows a kernel panic message and backtrace on the screen but nothing else happens for several minutes, then
kexecdid not work and the server should be rebooted by power-cycling it. - After the system has finished its panic test and hopefully booted
kexecproperly and dumped a core, let the server boot back up, log into it, then go and check the dump location. By default this is/var/crashon that server, but it can be configured for other locations in/etc/kdump.conf, so check there to see where the dump would be. The goal is to see if there is a directory in this location corresponding to the date/time of the panic that was tested. If there is, and there is avmcorein that directory, then we know thatkexecdid its job.
- Prior to panicking the node, it is useful to connect to the console of that server so the output can be watched. If there is a problem with
-
If
kexec/kdumpis confirmed to be working, yetfence_kdumpstill does not succeed, then there is most likely a communication issue between the host that issues thefence_kdumpcommand and the host that is panicking.- Check any firewalls between these hosts to see if something could be blocking the communications. Default port is 7410 over UDP.
- Determine if there are any special settings in the panicking node's
/etc/kdump.confthat could interfere with the ability of that host to bring up a network connection in kexec - Try to diagnose if any special steps or drivers or similar are needed to enable networking for this host in
kexec. This may require additional investigation with the help of Red Hat Support.
+ While the host is panicking during tests, try to connect to its listen port (7410 by default) usingncor other utilities to test if there is connectivity.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.