Why did a RHEL High Availability cluster node reboot - and how can I prevent it from happening again?
Environment
- Red Hat Enterprise Linux (RHEL) 7 and newer with the High Availability or Resilient Storage Add-Ons
Issue
- Why did a RHEL cluster node reboot?
- Why did my node get fenced?
- What is causing a node to stop responding in the cluster and get powered off by the cluster?
- How should I troubleshoot node-fencing scenarios?
Resolution
Follow these steps to address the most common causes of fencing or "force-rebooting" of cluster-nodes. See the Resolution and Diagnostic Steps sections below for guidance on determining the specific cause of fencing.
INCREASE CLUSTER COMMUNICATION TIMEOUT - totem token
Reason: The cluster is very sensitive by default - to ensure a quick recovery of resources. Increasing the cluster's timeout makes it less sensitive to short periods of unresponsiveness and will allow more time for a node to check-in before needing to be rebooted.
Summary:
- Edit
totem.tokenin/etc/corosync/corosync.conf. Example:
totem {
[...]
token: 10000
}
Then resync the configuration and reload corosync with:
# pcs cluster sync
# pcs cluster reload corosync
Additional references:
- How to change totem token timeout value in a RHEL 5, 6, 7, or 8 High Availability cluster?
- RHEL 7 - HIGH AVAILABILITY ADD-ON REFERENCE: 4.2. Configuring Timeout Values for a Cluster
INCREASE REDUNDANCY OF NETWORK COMMUNICATIONS
Reason: A cluster is at risk of fencing nodes or becoming non-operational if a network disruption occurs on the only interface(s) available to the cluster. Increasing redundancy allows the cluster to keep communicating even with the failure of one network or one interface.
Summary:
-
If additional network interfaces on redundant networks are not already available - they must be setup for all servers in the cluster
-
Consider configuring any cluster interfaces to use bonding or teaming for additional redundancy. See additional resources below.
-
RHEL 8: Add the available link(s) to the cluster configuration. See the additional resources below for documentation. For example:
# pcs cluster link add node1=10.0.5.11 node2=10.0.5.12 node3=10.0.5.31 options linknumber=5 -
RHEL 7: Downtime is required - the cluster must be stopped and reconfigured to enable more links.
# pcs cluster stop- Edit
/etc/corosync/corosync.conf, adding ring-addresses to each node for additional interfaces, and configuring totem interfaces. See the example here. - Start the cluster
# pcs cluster start - Edit
Additional resources:
- RHEL 8 - CONFIGURING AND MANAGING HIGH AVAILABILITY CLUSTERS: 4.3. Creating a high availability cluster with multiple links
- RHEL 8 - CONFIGURING AND MANAGING HIGH AVAILABILITY CLUSTERS: 18.6. Adding and modifying links in an existing cluster
- RHEL 7 - HIGH AVAILABILITY ADD-ON REFERENCE: 4.3. Configuring Redundant Ring Protocol (RRP)
- How can I configure a redundant heartbeat network for a RHEL 6 or 7 High Availability cluster?
- RHEL 8 - CONFIGURING AND MANAGING NETWORKING: 7.19. Configuring network bonding using nmcli
- RHEL 8 - CONFIGURING AND MANAGING NETWORKING: 7.21. Configuring network teaming using nmcli
- RHEL 7 - NETWORKING GUIDE: CHAPTER 7. CONFIGURE NETWORK BONDING
- RHEL 7 - NETWORKING GUIDE: CHAPTER 8. CONFIGURE NETWORK TEAMING
INCREASE RESOURCE MONITORING TIMEOUTS
Reason: The cluster regularly monitors the highly-available resources active on the nodes, so that failures can be responded to. If a monitor operation does not return within the allowed timeout, recovery action is taken, which may eventually lead to fencing if the resource cannot be stopped where it was running. Increasing this timeout can give these monitoring operations more time to complete so that resource recovery or node fencing is not needed.
Summary:
- Configure a global default operation timeout for all resource operations that do not have a timeout already specified in the configuration. For example:
# pcs resource op defaults timeout=240s
- Alternatively - especially when individual resource operations already have timeouts configured - configure each resource to have a higher timeout on monitor operations. For example:
# # Syntax: # pcs resource update <resource> op <operation> timeout=<value>
# # Example:
# pcs resource update ipaddr-corpLAN-appserver121 op monitor timeout=240s
Additional resources:
- RHEL 8 - CONFIGURING AND MANAGING HIGH AVAILABILITY CLUSTERS: 20.1. Configuring resource monitoring operations
- RHEL 8 - CONFIGURING AND MANAGING HIGH AVAILABILITY CLUSTERS: 20.2. Configuring global resource operation defaults
- RHEL 7 - HIGH AVAILABILITY ADD-ON REFERENCE: 6.6.1. Configuring Resource Operations
- RHEL 7 - HIGH AVAILABILITY ADD-ON REFERENCE: 6.6.2. Configuring Global Resource Operation Defaults
Root Cause
Follow these steps to check the most common sources that can help determine what may have happened to cause a node to be fenced or reboot. In some situations more advanced diagnostics may be required, but these checks should always be considered first.
CHECK IF THE CLUSTER CAN REBOOT NODES
For a cluster to reboot a node, the cluster must be configured with a fence mechanism that is able to reboot a node. Check the cluster's STONITH configuration to determine if the node is using a power-management STONITH method or an alternative STONITH method.
# pcs config
# pcs stonith config
Examples of power-management STONITH methods:
sbdfence_ipmilanfence_immfence_vmware_soapfence_vmware_restfence_rhevmfence_hpbladefence_ilo*fence_apcfence_apc_snmpfence_cisco_ucsfence_cisco_mdsfence_emersonfence_drac5fence_ibmbladefence_lparfence_xvmfence_bladecenter- A
fence_scsiscript is linked in/etc/watchdog.dand thewatchdogservice is running.
Examples of alternative STONITH methods that do not use power-management:
fence_scsifence_mpathfence_brocadefence_ifmib
If the cluster is not configured with any method that can power-off the node that rebooted, then the cluster was not responsible for the reboot. Investigate other possible causes.
CHECK FOR CLUES THAT RESOURCE FAILURES LED TO FENCING
Look for any recent resource failures:
# pcs cluster status
If there are resource stop-operation failures just prior to the time of the node reboot/fencing, this is very likely what triggered fencing of a node. Resource stop-failures are expected to trigger fencing, so investigate the cause in /var/log/messages.
CHECK FOR CLUES THAT THE NETWORK WAS DISRUPTED
Check /var/log/messages on the node that rebooted, right before when it rebooted. A system-boot usually starts by logging the Linux version, similar to:
Oct 31 14:00:47 rhel7-node3 kernel: Linux version 3.10.0-693.11.6.el7.x86_64 (mockbuild@x86-041.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Thu Dec 28 14:23:39 EST 2017
Oct 31 14:00:47 rhel7-node3 kernel: Command line: BOOT_IMAGE=/vmlinuz-3.10.0-693.11.6.el7.x86_64 root=/dev/mapper/rhel_dhcp182--144-root ro crashkernel=auto rd.lvm.lv=rhel_dhcp182-144/root rd.lvm.lv=rhel_dhcp182-144/swap console=tty0 console=ttyS0,115200 LANG=en_US.UTF-8
Look just before that. Does the node log any messages from corosync such as a processor failed, like these:
May 7 18:37:11 node2 corosync[12718]: [TOTEM ] A processor failed, forming new configuration.
May 7 18:37:23 node2 corosync[12718]: [QUORUM] Members[1]: 1
May 7 18:37:23 node2 corosync[12718]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
This first line means the node lost contact with one or more members, and the second line shows what members are still in contact - this node (node ID 1) being the only one left in contact. If something like this is seen on the fenced-node just before it reboots, then it very likely means there was a network disruption. Investigate the network.
You can compare this message to any similar messages on other nodes to see what the membership looked like from those others nodes. If multiple nodes around the cluster were still in contact but they all lost contact with one specific node, then the problem is likely on that node. If all nodes lost contact with each other, then it may be a problem on the network. If there are only two nodes and they both lose contact with each other, it is difficult to tell from the logs alone where the problem may have been.
Additional resources:
CHECK FOR CLUES THAT THE NODE WAS STARVED FOR SYSTEM RESOURCES
Check /var/log/messages on the rebooted-node just before it rebooted, similar to the previous section. Look for any messages from corosync that indicate the node was blocked from scheduling processes, similar to:
Dec 15 00:10:39 node42 corosync[33376]: [TOTEM ] Process pause detected for 14709 ms, flushing membership messages.
Dec 15 00:10:39 node42 corosync[33376]: [MAIN ] Corosync main process was not scheduled for 14709.0010 ms (threshold is 8000.0000 ms). Consider token timeout increase.
And for any message from pacemaker that indicates it wasn't able to write on /dev/shm, similar to:
Oct 22 21:12:18 fastvm-node1 pacemaker-attrd[6648]: error: couldn't create file for mmap
Oct 22 21:12:18 fastvm-node1 pacemaker-attrd[6648]: error: qb_rb_open:/dev/shm/qb-6648-6547-10-EajrgZ/qb-request-attrd: No space left on device (28)
Oct 22 21:12:18 fastvm-node1 pacemaker-attrd[6648]: error: shm connection FAILED: No space left on device (28)
Oct 22 21:12:18 fastvm-node1 pacemaker-attrd[6648]: error: Error in connection setup (/dev/shm/qb-6648-6547-10-EajrgZ/qb): No space left on device (28)
If messages like these are seen, then the node was under some sort of heavy resource usage that prevented the cluster processes from working properly, and so it was fenced. Investigate the resource usage on the system and try to decrease the workload, or increase the cluster communication timeout (see section in Resolution above).
It is also worth reviewing any resource-utilization data on the system, such as sar (/var/log/sa/sa<date>) or pcp (/var/log/pcp).
Additional resources:
CHECK FOR CLUES THAT THE KERNEL-PANICKED
Look in /var/crash for a folder that is timestamped around when the node rebooted. If one is found, it means the system's kernel crashed. Review the data contained in the directory for more clues.
The absence of a folder in /var/crash/ does not rule out the possibility of a kernel panic. Consider:
- Is
kdumpconfigured and running? Checksystemctl status kdump.serviceto see if it is running - Is
kdumpconfigured to dump to another location besides/var/crash/? Check/etc/kdump.conf - Did the cluster fence the node before it had time to dump any crash-evidence to the dump location? Consider setting up
fence_kdump(see Diagnostic Steps below)
CHECK FOR HPE AUTOMATIC SERVER RECOVERY (ON HPE SERVERS)
If the rebooted system is an HPE Proliant or Blade server, it may have ASR enabled, which can reboot a system that is not performing as expected. If it is an HPE server, check if this feature is enabled using hpasmcli if it is installed:
# hpasmcli -s "show asr"
Or check the HPE server configuration otherwise.
Additional resources:
Diagnostic Steps
When the cause of a node reboot is still unknown, set up advanced diagnostics to capture more data in the event of a reoccurrence.
CONFIGURE PCP RESOURCE MONITORING
If pcp is not already configured to capture system-resource statistics, set it up - as this is Red Hat's recommended resource-monitoring tool for RHEL systems.
In RHEL High Availability environments that are in need of diagnostic data for fence situations, Red Hat recommends a capture interval of 10 seconds and capturing per-process statistics.
NOTE: Be aware this may generate a large volume of data in /var/log/pcp - typically 300M-400M per day with the recommended settings below. Monitor disk-space usage and if it becomes a problem add disk space, reconfigure pcp capture settings, or stop the services if there is an imminent risk of filling up filesystems.
Summary:
- Install
pcp-zeroconf
# yum install pcp-zeroconf
This package enables pmcd and sets up data capture at a 10-second interval.
Upon any further incidents, sosreport should collect the pcp data files for further review. See section below for inspecting pcp data.
Additional resources:
- Content from pcp.io is not included.PCP Quick Reference Guide
- Content from pcp.io is not included.Performance Co-Pilot User's and Administrator's Guide
- How can I change the default logging interval used by Performance Co-Pilot (PCP)?
- How can I customize the Performance Co-Pilot logging configuration
- How to install and gather performance logging using Performance Co-Pilot (pcp) in RHEL 7.5+ or RHEL 8?
REVIEW PCP DATA AFTER A FENCING INCIDENT
If PCP is capturing data on the host, review it for signs of heavy resource usage. Such heavy usage can starve the cluster processes for the CPU time they need and can lead to fencing.
Summary:
- Find the
pmloggerarchive that covers the time period leading up to the reboot. These files are in/var/log/pcp/pmlogger/<hostname>/on the system, or at the same path (var/log/pmlogger/<hostname>/within a sosreport archive.pmdumplogcan print the time period captured by apmloggerarchive (and the time period is indicated in the filename as well):
# pmdumplog -L /var/log/pcp/pmlogger/node1.example.com/20191031.16.22.0
Log Label (Log Format Version 2)
Performance metrics from host node1.example.com
commencing Thu Oct 31 16:22:49.286425 2019
ending Thu Oct 31 16:26:59.426811 2019
Archive timezone: EDT+4
PID for pmlogger: 1897
- Display a listing of which metrics were being recorded in an archive:
# pminfo -a /var/log/pcp/pmlogger/node1.example.com/20191031.16.30.0
xfs.log.writes
xfs.log.blocks
xfs.log.noiclogs
[...]
- Play back any useful statistic from the archive at the time period leading up to the reboot, at whatever interval is helpful for the investigation. Perhaps start with 2-seconds in the few minutes just before the reboot, and adjust if there is a need to go back further and look at more data. For example, to display the
kernel.all.loadstatistic at a 2 second interval for a 4 minute window:
# pmval -t 2sec -f 3 kernel.all.load -S @"2019-10-31 16:30" -T @"2019-10-31 16:34" -a /var/log/pcp/pmlogger/cs-rh7-1.gsslab.rdu2.redhat.com/20191031.16.30
metric: kernel.all.load
archive: /var/log/pcp/pmlogger/cs-rh7-1.gsslab.rdu2.redhat.com/20191031.16.30
host: cs-rh7-1.gsslab.rdu2.redhat.com
start: Thu Oct 31 16:30:46 2019
end: Thu Oct 31 16:34:00 2019
semantics: instantaneous value
units: none
samples: 97
interval: 2.00 sec
16:30:46.645 No values available
1 minute 5 minute 15 minute
16:30:48.645 1.430 0.770 0.470
16:30:50.645 1.430 0.770 0.470
16:30:52.645 1.430 0.770 0.470
16:30:54.645 1.430 0.770 0.470
- Investigations looking for high resource-consumption can start with these base set of statistics to look for obvious culprits:
vfs.files.count
vfs.files.free
vfs.files.max
disk.all.read
disk.all.write
disk.all.blkread
disk.all.blkwrite
disk.all.total
mem.util.used
mem.util.free
mem.util.shared
mem.util.bufmem
mem.util.cached
mem.util.swapTotal
mem.util.swapFree
mem.util.dirty
mem.util.writeback
mem.util.hugepagesTotalBytes
mem.util.hugepagesFreeBytes
network.interface.total.bytes
network.interface.total.packets
network.interface.total.errors
network.interface.total.drops
network.interface.total.mcasts
kernel.all.runnable
kernel.all.running
kernel.all.blocked
kernel.all.cpu.user
kernel.all.cpu.nice
kernel.all.cpu.sys
kernel.all.cpu.idle
kernel.all.cpu.intr
pcp atopcan also be useful for seeing process-statistics at a given time.
# pcp -t 2sec -S @"2019-10-31 16:30" -T @"2019-10-31 16:34" -a /var/log/pcp/pmlogger/node1.example.com/20191031.16.30 atop
- With any metric being investigated, it may be helpful to play back data from older archives when there was no problem happening - to understand the "baseline" performance of the system. Compare that to the time of the event to determine if any metric seems unusual. Just remember - a spike in a metric doesn't necessarily mean it was the cause of the fence event; further investigation may be needed into whether that level is possible of causing problems in the cluster. Attempts to reproduce that load and watch for problems can be a useful strategy.
INSTALL spausedd
Reason: spausedd is a lightweight daemon that can detect when the system's CPU resources are overcommitted in a way that could prevent processes from being scheduled. Such overcommit scenarios are a common cause of a node becoming unresponsive and getting fenced, and often it leaves no clues in the logs before the node gets powered off. spausedd logs a message when these conditions are detected, which can assist future investigations into fencing.
Summary:
- Install
spauseddon all nodes
# yum install spausedd
- Start and enable
spauseddon all nodes
# systemctl enable spausedd
# systemctl start spausedd
Additional resources:
CONFIGURE KDUMP
Reason: If the kernel "panics" (crashes), the kdump service must be running in order for the evidence of that crash to be saved to disk. Without it running, a server that crashes because of a kernel panic may leave no clues behind.
Summary:
Make sure kdump is configured to capture kernel-crash cores and logs. Check if its running with:
# systemctl status kdump.service
It can also be useful to simulate a kernel-panic to see if a core is dumped.
If kdump is not running, set it up.
Additional resources:
- How to troubleshoot kernel crashes, hangs, or reboots with kdump on Red Hat Enterprise Linux
- This content is not included.Red Hat Customer Portal Labs - KDump Helper
CONFIGURE fence_kdump TO ALLOW CAPTURING OF KERNEL CRASH DATA
Reason: Even with kdump configured, fencing by the cluster may interrupt a crashed-system's attempt to dump crash data to disk. Configuring fence_kdump can allow that crash-data to be saved without interruption.
Additional resources:
CONFIGURE SERIAL CONSOLE REDIRECTION/MONITORING
Reason: Many server vendors, virtualization platforms, and cloud vendors offer remote-console viewing capabilities. This can be useful to monitor a cluster node that is rebooting for unknown reasons, as the kernel may print some useful clue to the console before it crashes or gets fenced. The console must be configured to redirect to an appropriate serial output device in order to capture or remotely view this output.
Summary:
- Consult with the hardware, virtualization, or cloud vendor for instructions on remote logging or remote capture of a server's serial console
- Configure the RHEL systems to redirect their console output to the serial port (see link below)
- Monitor the serial console remotely or set up a capture of the output if possible through the platform, and wait for an event. Check the output at the top to see if there is evidence of a crash or any other clues before the system reboots.
Additional resources:
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.