How do I configure fence_kdump in a Red Hat Pacemaker cluster?

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 6, 7, 8 or 9 (with the High Availability Add-on)
  • Pacemaker

Issue

  • How do I configure a fence_kdump stonith device in a Red Hat Pacemaker cluster?

Resolution

  1. Ensure that the kdump service is properly configured on all nodes and is started. The This content is not included.Kdump Helper web application can assist in configuring kdump.

    RHEL 6

     # service kdump status
     Kdump is operational
    

    RHEL 7, 8 or 9

     # systemctl is-active kdump
     active
    
  2. Ensure that the fence_kdump fence agent is installed on all cluster nodes.

    RHEL 6

     # yum install fence-agents
    

    RHEL 7, 8 or 9

     # yum install fence-agents-kdump
    
  3. Create a fence_kdump STONITH device in the cluster. The below command should be run from any one cluster node only.

     # pcs stonith create kdump fence_kdump pcmk_reboot_action="off" pcmk_host_list="node-1 node-2"
    

    Note: In some older versions, additional parameters may be needed if the STONITH device fails to start. For more information, see solution A fence_kdump STONITH device fails to start and fencing fails in RHEL 6 or 7 pacemaker clusters.

  4. Configure STONITH levels so that fence_kdump is the primary fencing device and the existing stonith device becomes secondary. More information on STONITH levels can be found in solution How to configure/manage STONITH 'levels' in RHEL cluster with pacemaker?

     # pcs stonith level add 1 node-1 kdump
     # pcs stonith level add 1 node-2 kdump
     # pcs stonith level add 2 node-1 fence-node-1
     # pcs stonith level add 2 node-2 fence-node-2
    

    Below is an example of how the resulting configuration will look. Note that fence_xvm is used as example; your other stonith device(s) may vary.

     # pcs config
     ...
     Stonith Devices:
      Resource: fence-node-1 (class=stonith type=fence_xvm)
       Operations: monitor interval=30s (fence-node-1-monitor-interval-30s)
      Resource: fence-node-2 (class=stonith type=fence_xvm)
       Attributes: delay=10
       Operations: monitor interval=30s (fence-node-2-monitor-interval-30s)
      Resource: kdump (class=stonith type=fence_kdump)
       Attributes: pcmk_reboot_action=off pcmk_host_list="node-1 node-2"
       Operations: monitor interval=60s (kdump-monitor-interval-60s)
     Fencing Levels:
      Node: node-1
       Level 1 - kdump
       Level 2 - fence-node-1
      Node: node-2
       Level 1 - kdump
       Level 2 - fence-node-2
     ...
    
  5. On all nodes (which includes remote nodes) allow port 7410/udp through the firewall. The 7410/udp needs to be open on all nodes (and remote nodes) so that the node can receive incoming notification from another node that is booted into kdump kernel (in order to collect a vmcore). Once the surviving node (which is running fence_kdump) receives the message from the node that is booted into kdump kernel then it will send its own reply to the node that is booted into the kdump kernel.

    RHEL 6

     # iptables -I INPUT -p udp --dport 7410 -j ACCEPT
     # service iptables save; service iptables restart
    

    RHEL 7, 8 or 9

     # firewall-cmd --add-port=7410/udp
     # firewall-cmd --add-port=7410/udp --permanent
    
  6. Rebuild an initramfs image file for kdump on all cluster nodes.

    RHEL 6

     # touch /etc/kdump.conf
     # service kdump restart
    

    RHEL 7, 8 or 9

     # touch /etc/kdump.conf
     # systemctl restart kdump
    
    • Note: The touch /etc/kdump.conf command is used to ensure the file exists before restarting the service, as systemctl restart kdump will rebuild the initramfs image if it detects that the configuration file has been modified or touched.
  7. Ensure that the initramfs image file contains fence_kdump_send and hosts files. For Red Hat Enterprise Linux 7, please note that a known issue where the hosts file is not included into the initramfs file is reported.

     # lsinitrd /boot/initramfs-$(uname -r)kdump.img | egrep "fence|hosts"
    
  8. Test the configuration by crashing one of the nodes using the command below.

     # echo c > /proc/sysrq-trigger
    

    On the crashed node, you may see the kdump progress on a graphical or serial console. On the cluster node where the fencing was triggered, you should see in the logs that fence_kdump is waiting for a message from that node.

     fencing node-1
     fence_kdump[XXXX]: waiting for message from '1.1.1.1'
    

    Once the dump on the crashed node is started, it will send messages every 10 seconds. The fencing node should then report that the message is received and confirm the fencing as successful.

     fence_kdump[XXXX]: received valid message from '1.1.1.1'
     fence node-1 success
    

Note: Usingfence_kdump can increase failover time (up to configured timeouts) if the failing node is unable to generate a kernel dump or the kernel dump process is unable to communicate with the surviving cluster nodes.

Note: If you encounter a timeout while waiting to receive the message in RHEL 6 or 7 and your kexec-tools package is older than version 2.0.14-17.el7, see the solution fence_kdump fails with "timeout after X seconds" in a RHEL 6 or 7 High Availability cluster for further information.

NOTE: If the cluster node is AWS using XEN hypervisor then kdump is not supported: XEN/AWS kdump support for RHEL.

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.