fence_kdump times out when cluster node names do not match hostnames
Environment
- Red Hat Enterprise Linux 6, 7, 8 (with the High Availability Add-on)
fence_kdump
Issue
fence_kdumpfails with aTimer expiredmessage if the output ofcrm_node -ndoes not match the output ofhostnameorhostname -sfor a node.fence_kdumpcan fail when a dedicated heartbeat IP address is used for each cluster node.
Resolution
Red Hat Enterprise Linux 6
There are no plans to fix this issue in RHEL 6.
Red Hat Enterprise Linux 7
- The issue (bz1760811) has been resolved with errata RHBA-2020:3885 with the following package(s):
kexec-tools-2.0.15-51.el7,kexec-tools-anaconda-addon-2.0.15-51.el7,kexec-tools-eppic-2.0.15-51.el7or later.
#####Red Hat Enterprise Linux 8 - The issue (bz1761602) has been resolved with errata RHBA-2020:4462 with the following package(s):
kexec-tools-2.0.20-34.el8or later.
Workaround
Configure `fence_kdump_nodes` as described in the comments of `/etc/kdump.conf`:
# fence_kdump_nodes <node(s)>
# - List of cluster node(s) except localhost, separated by spaces,
# to send fence_kdump notifications to.
Related issues
- Solution 2388711 - fence_kdump fails with "timeout after X seconds" in a RHEL 6 or 7 High Availability cluster with kexec-tools versions older than 2.0.14
- Solution 4498151 - fence_kdump times out when fence_kdump_nodes is not specified with kexec-tools version 2.0.15 or later
Root Cause
If fence_kdump_nodes is not configured explicitly in /etc/kdump.conf and a Pacemaker cluster is running when the kdump initrd is created, a dracut script automatically generates a fence_kdump_nodes list. For each node in the output of pcs cluster cib, the dracut script checks whether the node name matches the output of hostname or hostname -s. If so, that node is excluded from the fence_kdump_nodes list.
However, if the local node is known to Pacemaker by a name other than its hostname -- for example, if the hostname is node1 and the node name is node1-hb -- then the local node is not excluded from fence_kdump_nodes. When the local node is included in fence_kdump_nodes and the local node is executing the crash kernel, fence_kdump_send fails to send notifications to all the cluster nodes. fence_kdump can then fail with a Timer expired mesage.
kexec-tools-2.0.15-21.el7:
/usr/lib/dracut/modules.d/99kdumpbase/module-setup.sh:
628 # retrieves fence_kdump nodes from Pacemaker cluster configuration
629 get_pcs_fence_kdump_nodes() {
630 local nodes
631
632 # get cluster nodes from cluster cib, get interface and ip address
633 nodelist=`pcs cluster cib | xmllint --xpath "/cib/status/node_state/@uname" -`
634
635 # nodelist is formed as 'uname="node1" uname="node2" ... uname="nodeX"'
636 # we need to convert each to node1, node2 ... nodeX in each iteration
637 for node in ${nodelist}; do
638 # convert $node from 'uname="nodeX"' to 'nodeX'
639 eval $node
640 nodename=$uname
641 # Skip its own node name
642 if [ "$nodename" = `hostname` -o "$nodename" = `hostname -s` ]; then
643 continue
644 fi
645 nodes="$nodes $nodename"
646 done
647
648 echo $nodes
649 }
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.