Why did a RHEL High Availability cluster node reboot - and how can I prevent it from happening again?

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux (RHEL) 7 and newer with the High Availability or Resilient Storage Add-Ons

Issue

  • Why did a RHEL cluster node reboot?
  • Why did my node get fenced?
  • What is causing a node to stop responding in the cluster and get powered off by the cluster?
  • How should I troubleshoot node-fencing scenarios?

Resolution

Follow these steps to address the most common causes of fencing or "force-rebooting" of cluster-nodes. See the Resolution and Diagnostic Steps sections below for guidance on determining the specific cause of fencing.

INCREASE CLUSTER COMMUNICATION TIMEOUT - totem token

Reason: The cluster is very sensitive by default - to ensure a quick recovery of resources. Increasing the cluster's timeout makes it less sensitive to short periods of unresponsiveness and will allow more time for a node to check-in before needing to be rebooted.

Summary:

  • Edit totem.token in /etc/corosync/corosync.conf. Example:
totem {
     [...]
     token: 10000
}     

Then resync the configuration and reload corosync with:

# pcs cluster sync
# pcs cluster reload corosync

Additional references:


INCREASE REDUNDANCY OF NETWORK COMMUNICATIONS

Reason: A cluster is at risk of fencing nodes or becoming non-operational if a network disruption occurs on the only interface(s) available to the cluster. Increasing redundancy allows the cluster to keep communicating even with the failure of one network or one interface.

Summary:

  • If additional network interfaces on redundant networks are not already available - they must be setup for all servers in the cluster

  • Consider configuring any cluster interfaces to use bonding or teaming for additional redundancy. See additional resources below.

  • RHEL 8: Add the available link(s) to the cluster configuration. See the additional resources below for documentation. For example:

    # pcs cluster link add node1=10.0.5.11 node2=10.0.5.12 node3=10.0.5.31 options linknumber=5
    
  • RHEL 7: Downtime is required - the cluster must be stopped and reconfigured to enable more links.

    # pcs cluster stop
    
    • Edit /etc/corosync/corosync.conf, adding ring-addresses to each node for additional interfaces, and configuring totem interfaces. See the example here.
    • Start the cluster
    # pcs cluster start
    

Additional resources:


INCREASE RESOURCE MONITORING TIMEOUTS

Reason: The cluster regularly monitors the highly-available resources active on the nodes, so that failures can be responded to. If a monitor operation does not return within the allowed timeout, recovery action is taken, which may eventually lead to fencing if the resource cannot be stopped where it was running. Increasing this timeout can give these monitoring operations more time to complete so that resource recovery or node fencing is not needed.

Summary:

  • Configure a global default operation timeout for all resource operations that do not have a timeout already specified in the configuration. For example:
# pcs resource op defaults timeout=240s
  • Alternatively - especially when individual resource operations already have timeouts configured - configure each resource to have a higher timeout on monitor operations. For example:
# # Syntax: # pcs resource update <resource> op <operation> timeout=<value>
# # Example:
# pcs resource update ipaddr-corpLAN-appserver121 op monitor timeout=240s

Additional resources:


Root Cause

Follow these steps to check the most common sources that can help determine what may have happened to cause a node to be fenced or reboot. In some situations more advanced diagnostics may be required, but these checks should always be considered first.

CHECK IF THE CLUSTER CAN REBOOT NODES

For a cluster to reboot a node, the cluster must be configured with a fence mechanism that is able to reboot a node. Check the cluster's STONITH configuration to determine if the node is using a power-management STONITH method or an alternative STONITH method.

# pcs config
# pcs stonith config

Examples of power-management STONITH methods:

  • sbd
  • fence_ipmilan
  • fence_imm
  • fence_vmware_soap
  • fence_vmware_rest
  • fence_rhevm
  • fence_hpblade
  • fence_ilo*
  • fence_apc
  • fence_apc_snmp
  • fence_cisco_ucs
  • fence_cisco_mds
  • fence_emerson
  • fence_drac5
  • fence_ibmblade
  • fence_lpar
  • fence_xvm
  • fence_bladecenter
  • A fence_scsi script is linked in /etc/watchdog.d and the watchdog service is running.

Examples of alternative STONITH methods that do not use power-management:

  • fence_scsi
  • fence_mpath
  • fence_brocade
  • fence_ifmib

If the cluster is not configured with any method that can power-off the node that rebooted, then the cluster was not responsible for the reboot. Investigate other possible causes.


CHECK FOR CLUES THAT RESOURCE FAILURES LED TO FENCING

Look for any recent resource failures:

# pcs cluster status

If there are resource stop-operation failures just prior to the time of the node reboot/fencing, this is very likely what triggered fencing of a node. Resource stop-failures are expected to trigger fencing, so investigate the cause in /var/log/messages.


CHECK FOR CLUES THAT THE NETWORK WAS DISRUPTED

Check /var/log/messages on the node that rebooted, right before when it rebooted. A system-boot usually starts by logging the Linux version, similar to:

Oct 31 14:00:47 rhel7-node3 kernel: Linux version 3.10.0-693.11.6.el7.x86_64 (mockbuild@x86-041.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Thu Dec 28 14:23:39 EST 2017
Oct 31 14:00:47 rhel7-node3 kernel: Command line: BOOT_IMAGE=/vmlinuz-3.10.0-693.11.6.el7.x86_64 root=/dev/mapper/rhel_dhcp182--144-root ro crashkernel=auto rd.lvm.lv=rhel_dhcp182-144/root rd.lvm.lv=rhel_dhcp182-144/swap console=tty0 console=ttyS0,115200 LANG=en_US.UTF-8

Look just before that. Does the node log any messages from corosync such as a processor failed, like these:

May  7 18:37:11 node2 corosync[12718]:   [TOTEM ] A processor failed, forming new configuration.
May  7 18:37:23 node2 corosync[12718]:   [QUORUM] Members[1]: 1
May  7 18:37:23 node2 corosync[12718]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.

This first line means the node lost contact with one or more members, and the second line shows what members are still in contact - this node (node ID 1) being the only one left in contact. If something like this is seen on the fenced-node just before it reboots, then it very likely means there was a network disruption. Investigate the network.

You can compare this message to any similar messages on other nodes to see what the membership looked like from those others nodes. If multiple nodes around the cluster were still in contact but they all lost contact with one specific node, then the problem is likely on that node. If all nodes lost contact with each other, then it may be a problem on the network. If there are only two nodes and they both lose contact with each other, it is difficult to tell from the logs alone where the problem may have been.

Additional resources:


CHECK FOR CLUES THAT THE NODE WAS STARVED FOR SYSTEM RESOURCES

Check /var/log/messages on the rebooted-node just before it rebooted, similar to the previous section. Look for any messages from corosync that indicate the node was blocked from scheduling processes, similar to:

Dec 15 00:10:39 node42 corosync[33376]:   [TOTEM ] Process pause detected for 14709 ms, flushing membership messages.
Dec 15 00:10:39 node42 corosync[33376]:   [MAIN  ] Corosync main process was not scheduled for 14709.0010 ms (threshold is 8000.0000 ms). Consider token timeout increase.

And for any message from pacemaker that indicates it wasn't able to write on /dev/shm, similar to:

Oct 22 21:12:18 fastvm-node1 pacemaker-attrd[6648]: error: couldn't create file for mmap
Oct 22 21:12:18 fastvm-node1 pacemaker-attrd[6648]: error: qb_rb_open:/dev/shm/qb-6648-6547-10-EajrgZ/qb-request-attrd: No space left on device (28)
Oct 22 21:12:18 fastvm-node1 pacemaker-attrd[6648]: error: shm connection FAILED: No space left on device (28)
Oct 22 21:12:18 fastvm-node1 pacemaker-attrd[6648]: error: Error in connection setup (/dev/shm/qb-6648-6547-10-EajrgZ/qb): No space left on device (28)

If messages like these are seen, then the node was under some sort of heavy resource usage that prevented the cluster processes from working properly, and so it was fenced. Investigate the resource usage on the system and try to decrease the workload, or increase the cluster communication timeout (see section in Resolution above).

It is also worth reviewing any resource-utilization data on the system, such as sar (/var/log/sa/sa<date>) or pcp (/var/log/pcp).

Additional resources:


CHECK FOR CLUES THAT THE KERNEL-PANICKED

Look in /var/crash for a folder that is timestamped around when the node rebooted. If one is found, it means the system's kernel crashed. Review the data contained in the directory for more clues.

The absence of a folder in /var/crash/ does not rule out the possibility of a kernel panic. Consider:

  • Is kdump configured and running? Check systemctl status kdump.service to see if it is running
  • Is kdump configured to dump to another location besides /var/crash/? Check /etc/kdump.conf
  • Did the cluster fence the node before it had time to dump any crash-evidence to the dump location? Consider setting up fence_kdump (see Diagnostic Steps below)

CHECK FOR HPE AUTOMATIC SERVER RECOVERY (ON HPE SERVERS)

If the rebooted system is an HPE Proliant or Blade server, it may have ASR enabled, which can reboot a system that is not performing as expected. If it is an HPE server, check if this feature is enabled using hpasmcli if it is installed:

# hpasmcli -s "show asr"

Or check the HPE server configuration otherwise.

Additional resources:


Diagnostic Steps

When the cause of a node reboot is still unknown, set up advanced diagnostics to capture more data in the event of a reoccurrence.

CONFIGURE PCP RESOURCE MONITORING

If pcp is not already configured to capture system-resource statistics, set it up - as this is Red Hat's recommended resource-monitoring tool for RHEL systems.

In RHEL High Availability environments that are in need of diagnostic data for fence situations, Red Hat recommends a capture interval of 10 seconds and capturing per-process statistics.

NOTE: Be aware this may generate a large volume of data in /var/log/pcp - typically 300M-400M per day with the recommended settings below. Monitor disk-space usage and if it becomes a problem add disk space, reconfigure pcp capture settings, or stop the services if there is an imminent risk of filling up filesystems.

Summary:

  • Install pcp-zeroconf
# yum install pcp-zeroconf

This package enables pmcd and sets up data capture at a 10-second interval.

Upon any further incidents, sosreport should collect the pcp data files for further review. See section below for inspecting pcp data.

Additional resources:


REVIEW PCP DATA AFTER A FENCING INCIDENT

If PCP is capturing data on the host, review it for signs of heavy resource usage. Such heavy usage can starve the cluster processes for the CPU time they need and can lead to fencing.

Summary:

  • Find the pmlogger archive that covers the time period leading up to the reboot. These files are in /var/log/pcp/pmlogger/<hostname>/ on the system, or at the same path (var/log/pmlogger/<hostname>/ within a sosreport archive. pmdumplog can print the time period captured by a pmlogger archive (and the time period is indicated in the filename as well):
# pmdumplog -L /var/log/pcp/pmlogger/node1.example.com/20191031.16.22.0 
Log Label (Log Format Version 2)
Performance metrics from host node1.example.com
    commencing Thu Oct 31 16:22:49.286425 2019
    ending     Thu Oct 31 16:26:59.426811 2019
Archive timezone: EDT+4
PID for pmlogger: 1897
  • Display a listing of which metrics were being recorded in an archive:
# pminfo -a /var/log/pcp/pmlogger/node1.example.com/20191031.16.30.0
xfs.log.writes                  
xfs.log.blocks                                                       
xfs.log.noiclogs    
[...]
  • Play back any useful statistic from the archive at the time period leading up to the reboot, at whatever interval is helpful for the investigation. Perhaps start with 2-seconds in the few minutes just before the reboot, and adjust if there is a need to go back further and look at more data. For example, to display the kernel.all.load statistic at a 2 second interval for a 4 minute window:
# pmval -t 2sec -f 3 kernel.all.load -S @"2019-10-31 16:30" -T @"2019-10-31 16:34" -a /var/log/pcp/pmlogger/cs-rh7-1.gsslab.rdu2.redhat.com/20191031.16.30                                                                                                                 
                                                                                                                                                                                                                                                                                            
metric:    kernel.all.load                                                                                                                                                                                                                                                                  
archive:   /var/log/pcp/pmlogger/cs-rh7-1.gsslab.rdu2.redhat.com/20191031.16.30                                                                                                                                                                                                             
host:      cs-rh7-1.gsslab.rdu2.redhat.com                                                                                                                                                                                                                                                  
start:     Thu Oct 31 16:30:46 2019                                                                                                                                                                                                                                                         
end:       Thu Oct 31 16:34:00 2019                                                                                                                                                                                                                                                         
semantics: instantaneous value                                                                                                                                                                                                                                                              
units:     none                                       
samples:   97                                         
interval:  2.00 sec                                   
16:30:46.645  No values available                     
                                                                       
                 1 minute      5 minute     15 minute 
16:30:48.645        1.430         0.770         0.470 
16:30:50.645        1.430         0.770         0.470 
16:30:52.645        1.430         0.770         0.470 
16:30:54.645        1.430         0.770         0.470 
  • Investigations looking for high resource-consumption can start with these base set of statistics to look for obvious culprits:
vfs.files.count
vfs.files.free
vfs.files.max
disk.all.read
disk.all.write
disk.all.blkread
disk.all.blkwrite
disk.all.total
mem.util.used                                                                                                                                                                                                                                                                               
mem.util.free                                                                                                                                                                                                                                                                               
mem.util.shared                                                                                                                                                                                                                                                                             
mem.util.bufmem                                                                                                                                                                                                                                                                             
mem.util.cached                                                                                                      
mem.util.swapTotal
mem.util.swapFree
mem.util.dirty
mem.util.writeback
mem.util.hugepagesTotalBytes
mem.util.hugepagesFreeBytes
network.interface.total.bytes
network.interface.total.packets
network.interface.total.errors
network.interface.total.drops
network.interface.total.mcasts
kernel.all.runnable
kernel.all.running                                                                                                                                                                                                                                                                          
kernel.all.blocked                                                                                                                                                                                                                                                                          
kernel.all.cpu.user                                                                                                                                                                                                                                                                         
kernel.all.cpu.nice                                                                                                                                                                                                                                                                         
kernel.all.cpu.sys                                                                                                                                                                                                                                                                          
kernel.all.cpu.idle                                                                                                                                                                                                                                                                         
kernel.all.cpu.intr  
  • pcp atop can also be useful for seeing process-statistics at a given time.
# pcp -t 2sec -S @"2019-10-31 16:30" -T @"2019-10-31 16:34" -a /var/log/pcp/pmlogger/node1.example.com/20191031.16.30 atop
  • With any metric being investigated, it may be helpful to play back data from older archives when there was no problem happening - to understand the "baseline" performance of the system. Compare that to the time of the event to determine if any metric seems unusual. Just remember - a spike in a metric doesn't necessarily mean it was the cause of the fence event; further investigation may be needed into whether that level is possible of causing problems in the cluster. Attempts to reproduce that load and watch for problems can be a useful strategy.

INSTALL spausedd

Reason: spausedd is a lightweight daemon that can detect when the system's CPU resources are overcommitted in a way that could prevent processes from being scheduled. Such overcommit scenarios are a common cause of a node becoming unresponsive and getting fenced, and often it leaves no clues in the logs before the node gets powered off. spausedd logs a message when these conditions are detected, which can assist future investigations into fencing.

Summary:

  • Install spausedd on all nodes
# yum install spausedd
  • Start and enable spausedd on all nodes
# systemctl enable spausedd
# systemctl start spausedd

Additional resources:

CONFIGURE KDUMP

Reason: If the kernel "panics" (crashes), the kdump service must be running in order for the evidence of that crash to be saved to disk. Without it running, a server that crashes because of a kernel panic may leave no clues behind.

Summary:

Make sure kdump is configured to capture kernel-crash cores and logs. Check if its running with:

# systemctl status kdump.service

It can also be useful to simulate a kernel-panic to see if a core is dumped.

If kdump is not running, set it up.

Additional resources:


CONFIGURE fence_kdump TO ALLOW CAPTURING OF KERNEL CRASH DATA

Reason: Even with kdump configured, fencing by the cluster may interrupt a crashed-system's attempt to dump crash data to disk. Configuring fence_kdump can allow that crash-data to be saved without interruption.

Additional resources:


CONFIGURE SERIAL CONSOLE REDIRECTION/MONITORING

Reason: Many server vendors, virtualization platforms, and cloud vendors offer remote-console viewing capabilities. This can be useful to monitor a cluster node that is rebooting for unknown reasons, as the kernel may print some useful clue to the console before it crashes or gets fenced. The console must be configured to redirect to an appropriate serial output device in order to capture or remotely view this output.

Summary:

  • Consult with the hardware, virtualization, or cloud vendor for instructions on remote logging or remote capture of a server's serial console
  • Configure the RHEL systems to redirect their console output to the serial port (see link below)
  • Monitor the serial console remotely or set up a capture of the output if possible through the platform, and wait for an event. Check the output at the top to see if there is evidence of a crash or any other clues before the system reboots.

Additional resources:


SBR
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.