A Guide to Unexpected System Restarts

Updated 3 Dec 2020

Introduction

Red Hat provides a This content is not included.Kernel Oops Analyzer tool to help you diagnose a kernel crash issue. When you input a text or a file including one or more kernel oops messages, we will walk you through diagnosing the kernel crash issue. Try using the tool before you perform the manual steps below. It may find a solution for your kernel crash issue in seconds. You can leave feedback on the tool at This content is not included.Kernel Oops App Info.

While a Red Hat Enterprise Linux system will not reboot unless specifically configured to do so, there are still several instances in which an unexpected reboot can occur. At a basic level, these occurrences fall into three categories:

A deliberate action on the part of a user (fence event, shutdown commands, etc.)
A software fault upon the server (kernel panic, NMI, etc)
A hardware fault/power failure in the server (power supply failure, disk or memory corruption, etc.)

In this article we discuss how to identify these occurrences and steps to alter or prevent future occurrences.

Understanding the Environment

There are some important questions to ask when an unexpected reboot has recently occurred that will help narrow down likely causes. Taking our lead from the categories above:

Identifying deliberate actions/configurations that would cause a restart:
- Is the server in question a cluster node with an attached fence device?
- Was the software on the server performing any tasks which would change its typical resource use?
- Is the server configured with health monitoring software, such as HP ASR?
- Is there a Baseboard Management Controller connected to the system? HP iLO, Dell DRAC, etc.
Potential software faults will most typically leave traces in /var/log/messages, investigated in the next section.
Potential hardware faults are difficult to diagnose from an operating system level, but it remains important to note power failures, maintenance events, or other environmental occurrences around the time of the restart.

Investigating /var/log/messages

Many of the most common restart causes will leave traces in /var/log/messages. All full system restarts will begin by listing the kernel command line, so searching the message log for the phrase "Command line" is a good first step when beginning an investigation.

For example:

Sep 29 04:18:15 <hostname> kernel: Command line: ro root=LABEL=/ rhgb quiet crashkernel=128M@16M

Starting from this point and working backwards, look for messages similar to the following. Note that these are examples of trouble indicators, actual errors found may vary by application and release version:

User-initiated Shutdown
- shutdown: shutting down for system reboot
- init: Switching to runlevel: 6
- exiting on signal 15
- Got SIGTERM, quitting.
Veritas Cluster Fence Event
- GAB WARNING V-15-1-20138 Port h isolated due to client process failure
RHEL High-Availability Cluster Fence Event
- fenced[xxxx]: fencing node "node1.example.com"
- [TOTEM ] A processor failed, forming new configuration.
- [TOTEM] The token was lost in the OPERATIONAL state.
Hardware Fault
- CPU 1: Machine Check Exception: 4 Bank 4: ba00000000070f0f
- Kernel panic - not syncing: Machine check
- Kernel panic - not syncing: Uncorrected machine check
Thermal Event/Cooling Failure
- kernel: CPUX: Temperature above threshold, cpu clock throttled
- kernel: CPUX: Core power limit notification (total events = 1)
Power Button Pressed
- received event "button/power PWRF 00000000 00000000"
Non-Maskable Interrupt Received
- kernel: Uhhuh. NMI received for unknown reason XX.
- kernel: NMI received for unknown reason 00
- kernel: Dazed and confused, but trying to continue
- kernel: Do you have a strange power saving mode enabled?
Kernel Soft Lockup
- kernel: BUG: soft lockup - CPU#7 stuck for 10s!
Task Blocked for Too Long
- kernel: INFO: task <process>:60 blocked for more than 120 seconds.

These messages may not necessarily be the root cause of the reboot, but are important clues worth investigating further.

Where to Go Next

Should a situation become apparent in which the system has suffered a hang, lockup, or loss of service causing an external application to reboot it then an investigation of server load and performance leading up to the event is in order. By default, the System Activity Reporter facility provided by the sysstat package is the recorder of such data. Analyzing any SAR files collected is detailed further in our Knowledge Base. See How to analyze and interpret sar data.

Should none of the above messages show up in the logs, then the reboot cause can be narrowed down to an event that does not print messages to the logs. There are a limited number of operations that perform in this manner. The most prevalent of these follow.

Kernel Panic

A Red Hat Enterprise Linux system can be configured to reboot after experiencing a kernel panic. The kernel parameter by which this is set represents the number of seconds after a panic has been experienced before a reboot command will be issued, and is exposed in the /proc filesystem:

# cat /proc/sys/kernel/panic

If this value is set to 0, this functionality is disabled. Should an unexpected restart occur when this feature is enabled, there is a strong likelihood that the system is experiencing kernel panics. In these cases, we strongly recommend configuring This content is not included.netdump (version 4 or below) or kdump (version 5 or above) on the affected system to gather information regarding the panic cause.

NOTE: On a Red Hat Enterprise Linux 6 system, you can often speed up analysis of a kernel panic through use of a small file called the kernel log. See RHEL6: Speeding up kernel crash / hang analysis with the kernel log for more information.

SysRq

The SysRq facility contains functionality that can force an instantaneous system reboot. While shutdown commands are generally logged to the system's messages file, SysRq commands are not always captured in the same way. There are two ways a SysRq can be issued to cause a reboot. If the "Magic" SysRq key sequence has been enabled, then the key sequence Alt+PrintScreen+b will trigger a system reboot on the spot. This can be enabled and disabled with the kernel parameter kernel.sysrq, again exposed through the /proc filesystem:

# cat /proc/sys/kernel/sysrq

If this command returns 0, then triggering SysRq command with the above key sequence is disabled. A 1 indicates that this functionality is enabled.

Alternatively, the file /proc/sysrq-trigger can be used to issue a SysRq command whether or not the "Magic" key sequence is enabled and the command

# echo 'b' > /proc/sysrq-trigger

will instantly trigger a system reboot. Many different clustering software suites use this file and functionality as a fencing solution. The cluster management software will monitor the cluster nodes for errors or hangs, and upon detection that a node has become unresponsive the above command will be issued on the unresponsive node resulting in a restart. The Red Hat High Availability clustering software does not use this functionality, but if there is non-Red Hat clustering software present on the system it is recommended to investigate what fencing solution that cluster software employs.

IPMI and Baseboard Management Controllers

There are many pieces of software that will monitor a system for perceived performance difficulties, and if detected will use an IPMI signal to a BMC on the system board to restart the poorly performing server. Different implementations of IPMI exist on different hardware platforms, including HP iLO and Dell DRAC. A frequent culprit of this type of unexpected reboot is the Automated System Recovery (ASR) functionality provided by the hp-health package on HP hardware with iLO cards. If this packages is installed, one can check for ASR events with the following commands:

# hpasmcli -s "show asr"
# hpasmcli -s "show iml"

Additionally, some clustering software, including Red Hat's own, can use IPMI signals to fence unresponsive nodes. If the server in question has such hardware installed, investigating the related hardware logs and/or cluster logs can shed further light on reboot occurrences.

Failing Hardware

Should no evidence of the above be present, then the remaining piece of the equation to investigate is hardware. There have been previous cases where a bad motherboard, faulty CPU, or a failing Power Supply Unit has caused power to be lost to the machine causing a hard shutdown. This behavior is entirely dependent on the hardware within the system, and performing full hardware diagnostics against the machine is generally the only method to rule this out as a possibility.
If the server has BMC, there would be a chance to check the hardware info via ipmitool command like examples below. (man ipmitool to know more options)

# ipmitool sensor
CPU Temp     | 22.000     | degrees C  | ok    | na        | na        | na        | 40.000    | 45.000    | 50.000
AVG Power    | 500.000    | Watts      | ok    | na        | na        | na        | na        | na        | na
Fan 1        | 4925.000   | RPM        | ok    | na        | 850.000   | na        | na        | na        | na
......

# ipmitool sel list
1 | Pre-Init Time-stamp   | Power Supply VRM Status | Presence detected | Asserted
2 | Pre-Init Time-stamp   | Power Unit Sys pwr monitor | Power off/down | Deasserted
......

Presence of third party software

We have been informed that software within certain third party packages could cause unexpected reboots. List of packages that have been detected and their third party knowledge base article.

Disclaimer: The following information has been provided by Red Hat, but is outside the scope of the posted Service Level Agreements and support coverage as it involves software not provided by Red Hat. For further information on these issues, you should reach out to your Intel support team and have them take a look. The information is provided as-is and any configuration settings or installed applications made from the information in this article could make the Operating System unsupported by Red Hat Global Support Services. The intent of this article is to provide information to accomplish the system's needs. Use of the information in this article at your own risk.

Content from support.hpe.com is not included.Content from support.hpe.com is not included.https://support.hpe.com/hpesc/public/docDisplay?docId=a00106372en_us&docLocale=en_US