How to troubleshoot kernel crashes, hangs, or reboots with kdump on Red Hat Openshift Container Platform

Solution Verified - Updated

Environment

  • Red Hat Openshift Container Platform (RHOCP)
  • Red Hat CoreOS (RHCOS)

Issue

  • How can kdump be configured on Red Hat OpenShift Container Platform (RHOCP) cluster nodes to investigate node crashes?
  • What is the procedure to configure kexec/kdump on Red Hat CoreOS systems?
  • How can unexpected reboots be investigated and troubleshooted effectively?
  • What steps are required to generate a kernel memory core dump (vmcore) on a system?
  • Root Cause Analysis (RCA) of kernel panic / server crash is required
  • The system entered a hung state or became unresponsive. How can this issue be effectively troubleshooted?
  • How much time is required to capture a vmcore?
  • How much disk space is required to generate a vmcore?

Resolution

Review the kdump documentation for the specific Red Hat OpenShift Container Platform (RHOCP) version in use to ensure the service is configured appropriately based on the deployment requirements.

  • For Red Hat CoreOS 4 and Red Hat OpenShift Container Platform 4, refer to the below Troubleshooting operating system issues documentation for each minor version:

For certain hardware and workloads the reserve memory for the crash kernel might not be enough, please set it as per requirement.

NOTES:

  • Starting with RHOCP 4.13+ which is based on RHEL9 kernels, crashkernel=auto is not supported. To automatically reserve an appropriate amount of memory for the kdump kernel use crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M value. For more information refer:crashkernel=auto parameter is deprecated in RHEL9.
  • The classic syntax like crashkernel=size[@offset] is also available, though it is recommended to use crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M. If the classic syntax sets, please set auto_reset_crashkernel no to avoid overwriting the value when updating kexec-tools. Otherwise this value will be overwritten to default value: See the KCS
  • When making a change to the main kdump configuration file (/etc/kdump.conf), restarting the service is required via the systemctl restart kdump command. For clusters using OVN network a reboot is needed.
  • If the node is going to be rebooted later, this command can be skipped.

To configure kdump more extensively, or in non-standard environments, please refer to the Extended KDUMP Configurations section.

Contents

  1. Prerequisites
  2. Installing KDUMP
  3. Extended KDUMP Configurations
  4. Sizing Local Dump Targets
  5. Testing KDUMP
  6. Vmcore Capture Time
  7. Controlling which events trigger a Kernel Panic

Prerequisites

  • For dumping cores to a network target, access to a server over NFS or SSH is required.
  • Whether dumping locally or to a network target, a volume, device or directory with enough free disk space is needed to hold the core file. See the Sizing Local Dump Targets section for more information.

Installing KDUMP

Verify the kexec-tools package is installed on the node.

# rpm -q kexec-tools

RHCOS ships kexec-tools package by default. If it is not installed, please open a support case with Red Hat.

Extended KDUMP Configurations

If the system or environment requires an extended or non-standard kdump configuration, please refer to the below links:

Note: Though KVM and RHEV guests are not required to use the aforementioned method, it is an additional option for capturing a vmcore when the virtual guest is unresponsive.

Sizing Local Dump Targets

The size of the vmcore file, and therefore the amount of disk space necessary to store it, will mainly depend on the following:

  • How much of the system’s RAM was in use at the moment of the kernel panic
  • What type of data is stored on the RAM.
  • The type of compression and the dump level stated in the “core_collector” parameter of the /etc/kdump.conf file

In more recent RHCOS versions, and with the default compression level discarding pages not related to kernel memory, the average size of a vmcore is relatively small (when compared to total system RAM). Please refer to the latest user statistics in order to estimate the amount of free space to reserve for the dump target.

That being said, the only reliable way to guarantee that a full vmcore is generated is for the dump target to have free space at least equal in size to the physical RAM.

To determine the actual size of a vmcore, and to verify that the desired kdump configuration works, it is recommended to manually crash the system.
Note: Testing requires down time for the intended systems.

Testing KDUMP

Regular methods of kdump testing for all RHCOS versions:

  • After configuring kdump, please schedule down time for the relevant systems in order to manually test a system crash and to verify that a full vmcore is generated in the configured dump target.
    Warning: These testing procedures will panic the kernel, killing all services on the machine.

  • It is recommend to first test the kdump configuration by issuing a Kernel panic via the SysRq-Trigger.
    The SysRq-Facility is a special key combination that, when enabled, allows the user to force a system’s kernel to respond to a specific command. This feature is mostly for troubleshooting kernel-related problems, or to force a response from a system while it is in a non-responsive state (hang).

  • After confirming a full vmcore is generated from a SysRq panic, it is recommend to continue testing by issuing a Non-Maskable Interrupt (NMI). This can be triggered by pushing an NMI button.
    An NMI is an interrupt that is unable to be ignored by standard operating system mechanisms. It is generally used only for critical hardware errors. This feature can be used to signal an operating system when other standard input mechanisms (keyboard, ssh, network, etc.) have ceased to function.

    • Triggering a panic via the NMI button is a more trustworthy method of obtaining a vmcore when the system hangs than using the SysRq-Facility trigger, as in some cases the NMI is able to force the system to respond even when standard keyboard input will not be accepted.

The preferred testing procedure is described below:

  1. Test the kdump configuration by using the SysRq-Facility to trigger a kernel panic. If kdump works correctly, the system is rebooted and a full vmcore is saved.
  2. If a full vmcore is saved, configure the NMI-related sysctl parameters.
  3. Reboot the system once to make sure the configuration is persistent.
  4. For testing the NMI button, push the button to trigger a kernel panic. If the NMI button works correctly, the system is rebooted and a full vmcore is saved.
Configuring and manually crashing a system:

Once it is verified that kdump is functional with the current configuration, the changes can be propagated to all the nodes via machine-config.

Make sure the node is in cordon state, this will make sure no workloads are scheduled.

# oc adm cordon <node-name>

SSH to the node and configure the SysRq-Facility to permit all triggers:

# sysctl -w  kernel.sysrq=1
   OR
# echo 1 > /proc/sys/kernel/sysrq

A panic can be triggered by issuing the echo c > /proc/sysrq-trigger command.
It can also be triggered using the <ALT> + <SysRq> + C key combination on the console.

Confirm a full vmcore is generated, and move on to configure the NMI related parameters.
If only an incomplete vmcore was saved, please refer to the Sizing Local Dump Targets and Diagnostic Steps sections.

To configure the kdump to panic and generate a vmcore when the NMI button is pushed, enter the following commands:

 # vim /etc/sysctl.conf
…
    kernel.unknown_nmi_panic = 1
    kernel.panic_on_io_nmi = 1
    kernel.panic_on_unrecovered_nmi = 1

Afterwards, reboot the system once and make sure the NMI configuration persisted.
Then, generate an NMI from the respective platform and verify that a full vmcore has been generated in the dump path.

Time required to capture vmcore

Dumping time depends on the options that are used for its configuration. Refer to How to determine the time required for dumping a vmcore file with kdump?

Controlling which events trigger a Kernel Panic

There are several parameters that control under which circumstances kdump is activated. Most of these can be enabled via sysctl tunable parameters, please refer to the most commonly used below.
When configuring a sysctl tunable via a sysctl.conf file, make sure to enforce the rule and make it persistent by issuing the sysctl -p <file path> command via sudo or the root user (if a file path is not specified, the default is /etc/sysctl.conf).

Note: While it is possible to enable multiple such tunables simultaneously, so as to make sure a vmcore is generated in as many scenarios as possible, please verify beforehand that each tunable is suitable for the expected workload and environment.

  • The below kernel sysctl parameters cannot be changed to a different value than below if protectKernelDefaults is set to true in kubelet.conf. From Red Hat OpenShift Container Platform (RHOCP) 4.13 on-wards kubelet.conf has protectKernelDefaults set to true by default. To configure the parameters listed below, please set protectKernelDefaults to false in kubelet.conf.
kernel.keys.root_maxbytes=25000000
kernel.keys.root_maxkeys=1000000
kernel.panic=10
kernel.panic_on_oops=1
vm.overcommit_memory=1
vm.panic_on_oom=0
System hangs due to an NMI
  • Occurs when a Non-Maskable Interrupt is issued, usually due to a hardware fault.

    • To configure the kernel to panic when an NMI occurs, add the following to the sysctl.conf file:
     # vim /etc/sysctl.conf
    …
        kernel.unknown_nmi_panic = 1
        kernel.panic_on_io_nmi = 1
        kernel.panic_on_unrecovered_nmi = 1
    
Out of Memory (OOM) Kill event
  • Occurs when a memory request (Page Fault or kernel memory allocation) is made while not enough memory is available, thus the system terminates an active task (usually a non-prioritized process utilizing a lot of memory).

    • To configure the kernel to panic when an OOM-Kill event occurs, add the following to the sysctl.conf file:
     # vim /etc/sysctl.conf
    …
          vm.panic_on_oom = 1
    
CPU Soft Lockup event
  • Occurs when a task is using the CPU for more than time the allowed threshold (the tunable kernel.watchdog_thresh, default is 20 seconds).

    • To configure the kernel to panic when a CPU Soft Lockup occurs, add the following to the sysctl.conf file:
     # vim /etc/sysctl.conf
    …
          kernel.softlockup_panic = 1
    
Hung / Blocked Task event
  • Occurs when a process is stuck in Uninterruptible-Sleep (D-state) more time than the allowed threshold (the tunable kernel.hung_task_timeout_secs, default is 120 seconds).

    • To configure the kernel to panic when a task becomes hung, add the following to the sysctl.conf file:
     # vim /etc/sysctl.conf
    …
          kernel.hung_task_panic = 1
    

To configure kdump more extensively, or in non-standard environments, please refer to the Extended KDUMP Configurations section.

Diagnostic Steps

For issues with configuring kdump or generating a full vmcore, refer to the links below that cover common problems.

If the issues persist or unexpected behavior occurs, submit a new This content is not included.Technical Support case along with the data listed below:

  • Make sure kdump service is enabled and active.
# systemctl status kdump
● kdump.service - Crash recovery kernel arming
     Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; preset: disabled)
     Active: active (exited) since Wed 2025-05-14 16:21:37 UTC; 51s ago
    Process: 1149 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCESS)
   Main PID: 1149 (code=exited, status=0/SUCCESS)
        CPU: 28.269s
  • Make sure reserve memory for the crash kernel is set:
# cat /proc/cmdline 
BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-28af98682992630498bb0eb62633902be1a9fbd025565b98a6cb9270233aafe8/vmlinuz-5.14.0-284.104.1.el9_2.x86_64 ostree=/ostree/boot.1/rhcos/28af98682992630498bb0eb62633902be1a9fbd025565b98a6cb9270233aafe8/0 ignition.platform.id=openstack console=ttyS0,115200n8 console=tty0 root=UUID=8800aedb-a542-4e6d-b484-12ff1f2a62df rw rootflags=prjquota boot=UUID=48df506b-3723-4ca0-9f1b-656194c02723 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=1 crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M
  • Make sure the crashkernel parameter is also present in rpm-ostree db. If not, on next reboot kdump will not be functional.
# rpm-ostree kargs
$ignition_firstboot ostree=/ostree/boot.1/rhcos/28af98682992630498bb0eb62633902be1a9fbd025565b98a6cb9270233aafe8/0 ignition.platform.id=openstack console=ttyS0,115200n8 console=tty0 root=UUID=8800aedb-a542-4e6d-b484-12ff1f2a62df rw rootflags=prjquota boot=UUID=48df506b-3723-4ca0-9f1b-656194c02723 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1="all" psi=1 crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M
  • Check if reserve memory for crash kernel is enough:
# kdumpctl showmem
kdump: Reserved 256MB memory for crash kernel

# kdumpctl estimate
Reserved crashkernel:    256M
Recommended crashkernel: 256M

Kernel image size:   36M
Kernel modules size: 10M
Initramfs size:      65M
Runtime reservation: 64M
Large modules:
    xfs: 2048000
    kvm: 1138688
  • Check journalctl logs for any failures:
# journalctl -u kdump -f
May 14 16:21:31 worker-2 dracut[1665]: Stored kernel commandline:
May 14 16:21:31 worker-2 dracut[1665]: No dracut internal kernel commandline stored in the initramfs
May 14 16:21:31 worker-2 dracut[1665]: *** Install squash loader ***
May 14 16:21:32 worker-2 dracut[1665]: *** Squashing the files inside the initramfs ***
May 14 16:21:37 worker-2 dracut[1665]: *** Squashing the files inside the initramfs done ***
May 14 16:21:37 worker-2 dracut[1665]: *** Creating image file '/var/lib/kdump/initramfs-5.14.0-284.104.1.el9_2.x86_64kdump.img' ***
May 14 16:21:37 worker-2 dracut[1665]: *** Creating initramfs image file '/var/lib/kdump/initramfs-5.14.0-284.104.1.el9_2.x86_64kdump.img' done ***
May 14 16:21:37 worker-2 kdumpctl[1157]: kdump: kexec: loaded kdump kernel
May 14 16:21:37 worker-2 kdumpctl[1157]: kdump: Starting kdump: [OK]
May 14 16:21:37 worker-2 systemd[1]: Finished Crash recovery kernel arming.
  • Make sure dump target is configured:
# cat /etc/kdump.conf | egrep -v "^$|#"
path /var/crash
core_collector makedumpfile -l --message-level 7 -d 31
  • Make sure kdump initramfs has the necessary modules and configuration files needed to capture vmcore:
# lsinitrd /var/lib/kdump/initramfs-$(uname -r)kdump.img | egrep -i "<module-name>|<config-file-name>
  • Capture serial console logs at the time of crash.
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.