How to collect system information to provide to Red Hat Support for analysis when a system hangs

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux [All versions]

Issue

  • How can I collect system information to provide to Red Hat Support for analysis when the system hangs or becomes unresponsive?
  • What can be done if RHEL system goes to hang state?
  • How to start with root cause analysis (RCA) for system hung issues?

Resolution

Note: This document only covers common situations with unresponsive systems. Please consult Red Hat Support for specific cases: This content is not included.Red Hat Technical Support – contact numbers and availability

For root cause analysis of why a system became unresponsive, various pieces of information are necessary, including a system core dump ("vmcore").

Preparation steps
  1. Pre-configure kdump/netdump/diskdump
  2. Enable sysrq
  3. Enable nmi_watchdog
  4. Test the configuration successfully.

These steps are discussed in more detail below.

Steps to collect information when the problem happens
1.  Check the status of system

Check

  • If the system can be logged in via ssh or telnet,
  • If the system can be logged in via console,
  • If the system can be pinged ok,
  • If the system responds to keyboard or mouse in any way.

If there is a response to any of these, it means that the system can still respond to some interrupts.

2. Get information about the system state through sysrq
  • Note: If you plan to follow the next step, "3. Crash the system to obtain a vmcore", please skip this step since the system state will fully be captured in the vmcore.
  • Run the following key combinations:
Alt + SysRq + m
Alt + SysRq + w
Alt + SysRq + t
Alt + SysRq + p

Note: Please run these key combinations three times with an interval of about 3 minutes in between.  Capture the information that is printed on the screen directly or, if the system is configured for netconsole, from the server that logs its console messages.

3. Crash the system to obtain a vmcore
  • Run the following key combination:
    Alt + SysRq + c
    
A vmcore file should be collected via the pre-configured kdump/netdump/diskdump service.
4.  Get sosreport/sysreport

When the vmcore file is finished, the system should be rebooted. After the system is rebooted, generate a sosreport/sysreport.

  • Red Hat Enterprise Linux 4.6 and later:
    # sosreport
    
  • Before Red Hat Enterprise Linux 4.5:
    # sysreport
    
**If  there is no response under all the situations described above, it means that the system might not respond any interrupt. After a few while, a kernel panic will happen and a vmcore should be captured and provided to Red Hat Support later.**

Note:

  • If the keyboard or mouse is connected to KVM, there may be no response when keyboard or mouse are used. If possible, use a PS/2 keyboard to connect to the system directly instead.
  • If you are unsure of the situation the system is in, please contact Red Hat Support first.
  • Capturing a vmcore may take a long time during which the system cannot perform its regular function. You may need to make a business decision whether to wait for the vmcore to be captured (thus increasing the chances a root cause can be identified for the issue and a reoccurrence can be prevented) or whether to restore the system to its regular functions on short notice.
  • Some systems use warning lights or beeps to inform the operator of a hardware error. If the system reports a hardware error, please also report this information to Red Hat Support.

Comments

In order to collect these information, some settings need to be pre-configured.

1. Configure kdump/netdump/diskdump

Please refer the following articles.

Note:

2. Enable sysrq

Magic SysRq key is a 'magical' key combo you can hit which the kernel will respond to regardless of whatever else it is doing, unless it is completely locked up. For security reasons, Red Hat Enterprise Linux disables the SysRq key by default.

  • To enable sysrq, run:
    # echo 1 > /proc/sys/kernel/sysrq
    
  • To enable it permanently, set the kernel.sysrq value in /etc/sysctl.conf to 1. That will cause it to be enabled on reboot.
    kernel.sysrq = 1
    
About detailed usage of Magic SysRq keys, please refer to: *[What is the SysRq facility and how do I use it?](/knowledge/node/2023)*
3-1. Enable nmi_watchdog

The Non-Maskable Interrupt (NMI) Watchdog in Red Hat Enterprise Linux is a mechanism used to detect system lockups . It has been available since Red Hat Enterprise Linux 3 Update 3. By default, nmi_watchdog is enabled on 64-bit systems. How to enable nmi_watchdog, please refer to: What is NMI and what can I use it for?

3-2. Enable NMI switch

Some times SysRq key and NMI watchdog do not work. If these do not work, use NMI switch. For details of NMI switch, please refer to How can I configure my system to crash when NMI switch is pushed?.

4. Enable netconsole

Netconsole allows dmesg output to be transmitted via the network through the use of syslog. It implements kernel-level network logging via UDP port 514. Please see the following article for reference: How do I configure netconsole?

Note:

5. Install sysstat package

This package provides the sar and iostat commands for Linux. Sar and iostat enable system monitoring of disk, network, and other I/O activity. This package is not installed by default. Please install it manually. If you does not install the sysstat package, install the package by yum or up2date command.

  • Red Hat Enterprise Linux 5 and 6
    # yum install sysstat
    
  • Red Hat Enterprise Linux 3 and 4
    # up2date -i sysstat
    
In order to collect information of system activities more detailedly, there is a method to change the monitor interval from 10 minutes(by default) to 1 minute. Modify the content of /etc/cron.d/sysstat.
  • /etc/cron.d/sysstat: before edit
    */10 * * * * root /usr/lib/sa/sa1 1 1
    
  • /etc/cron.d/sysstat: after edit
    */1 * * * * root /usr/lib/sa/sa1 1 1
    
It will take effect immediately without restarting any service. And it won't take a big effect on system performance.
6. Test kdump/netdump/diskdump configuration
  • Run the following command to trigger kernel panic:
    # echo c > /proc/sysrq-trigger
    
Normally the system will dump memory information after kernel panic happens. After dumping and restarting the system, the file vmcore will be found at a specific place and it's size is almost the same with memory by running the command *ls*.

Two purposes are considered for this step:

  • make sure that dump configuration can work well and capture vmcore file.
  • make sure how long it will take to capture vmcore file. This is useful for system admin to determine if such long down time is acceptable.

Note:

  • The command echo c > /proc/sysrq-trigger is used to trigger kernel panic manually which will lead to system crash and all applications will stop working. This operation should be done in maintenance time.
  • If the system configuring kdump/netdump/diskdump is one node of Red Hat Cluster, the cluster.conf should be modified to increase post_fail_delay parameter. It should be long enough for system to dump the memory information before it's fenced by other nodes.
  • If there is iptables settings in the production environment, make sure the ports are not blocked by iptables.     
    • kdump can use network service to transfer vmcore file,such as ssh and nfs. All ports that ssh or nfs uses should be accepted in iptables settings.

    • netdump server and client both use UDP 6666 port by default. The port for netdump server can be set in /etc/sysconfig/netdump-server. Detailed information, please man 8 netdump-server. The port for netdump client can be specified in /etc/sysconfig/netdump.

    • netconsole needs UDP 514 port for syslog.

7. sosreport/sysreport

Refer the following articles.

8. Capturing a vmcore from a hypervisor

Refer the following articles.

References
  • kdump: /usr/share/doc/kexec-tools-<version>/kexec-kdump-howto.txt
  • diskdump: /usr/share/doc/diskdumputils-<version>/README
  • sysrq: /usr/share/doc/kernel-doc-<version>/Documentation/sysrq.txt
  • nmi_watchdog: /usr/share/doc/kernel-doc-<version>/Documentation/nmi_watchdog.txt
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.