sbd watchdog timeout causes node to reboot during crash kernel execution
Environment
- Red Hat Enterprise Linux Server 6, 7 (with the High Availability Add-on)
- Pacemaker
sbd
Issue
sbdwatchdog timeout causes node to reboot during crash kernel execution.- vmcore is not collected because
SBD_WATCHDOG_TIMEOUTexpires. sbdreboots a node whilekdumpis running, even withfence_kdumpconfigured.
Resolution
Add the watchdog module for your system to the extra_modules line of /etc/kdump.conf.
# grep ^extra_modules /etc/kdump.conf
extra_modules i6300esb
If you don't know the name of your watchdog module, stop cluster services (so that sbd will relinquish control of the watchdog) and run wdctl to find out. In the below example, the module name is "i6300esb".
# pcs cluster stop --all
# wdctl | grep Identity
Identity: i6300ESB timer [version 0]
# lsmod | grep -i i6300ESB
i6300esb 13566 0
Root Cause
sbd provides a method of self-fencing that relies on a hardware watchdog timer. For a detailed conceptual discussion, refer to Exploring RHEL High Availability's Components - sbd and fence_sbd.
Simply put, sbd writes to /dev/watchdog frequently so that the watchdog timer does not expire. If sbd stops updating the timer for any reason, the timer will count down and eventually expire. The system will then reboot.
The timeout value is found in the SBD_WATCHDOG_TIMEOUT value in /etc/sysconfig/sbd.
A panic causes the crash kernel to execute and try to collect a vmcore. The crash kernel operates in a minimal environment by default. The watchdog kernel module is not loaded, and /dev/watchdog is not created. As a result, there is no interface between the OS and the watchdog device during crash kernel execution.
/etc/kdump.conf provides an option to load additional modules into the crash kernel. The option is called extra_modules. Adding the name of the watchdog module to the extra_modules line allows the creation of /dev/watchdog as an interface to the watchdog device. With the module loaded, the timer can be updated so that it does not expire and force a reboot in the middle of vmcore generation.
Diagnostic Steps
-
Verify that
sbdis configured.# pcs config | egrep 'watchdog|auto_tie_breaker' have-watchdog: true auto_tie_breaker: 1 # pcs status | grep sbd sbd: active/enabled -
Obtain the
SBD_WATCHDOG_TIMEOUTvalue. If this value is not configured, it defaults to 5 seconds.# awk '$1 ~ /^SBD_WATCHDOG_TIMEOUT=/' /etc/sysconfig/sbd -
Trigger a panic, which will cause the crash kernel to execute.
# echo c > /proc/sysrq-trigger -
Time the crash kernel execution, and observe that the node reboots approximately
<SBD_WATCHDOG_TIMEOUT>seconds after triggering the panic. If you are having difficulty with this step, you can configure akdump_prescript like the following to print a message to the console once per second. This example assumes anSBD_WATCHDOG_TIMEOUTof 5. (Note that this may be a couple of seconds different from theSBD_WATCHDOG_TIMEOUT, sincekdump_predoes not execute immediately after you trigger the panic.)#!/bin/sh echo '### RUNNING KDUMP_PRE ###' for i in {1..6}; do echo $i /usr/bin/sleep 1 done -
If you used the
kdump_prescript in Step 4, you should see output in the console similar to the following, followed by a reboot.... ### RUNNING KDUMP_PRE ### 1 2 3
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.