How to troubleshoot kernel crashes, hangs, or reboots with kdump on Red Hat Openshift Container Platform
Environment
- Red Hat Openshift Container Platform (RHOCP)
- Red Hat CoreOS (RHCOS)
Issue
- How can kdump be configured on Red Hat OpenShift Container Platform (RHOCP) cluster nodes to investigate node crashes?
- What is the procedure to configure kexec/kdump on Red Hat CoreOS systems?
- How can unexpected reboots be investigated and troubleshooted effectively?
- What steps are required to generate a kernel memory core dump (vmcore) on a system?
- Root Cause Analysis (RCA) of kernel panic / server crash is required
- The system entered a hung state or became unresponsive. How can this issue be effectively troubleshooted?
- How much time is required to capture a vmcore?
- How much disk space is required to generate a vmcore?
Resolution
Review the kdump documentation for the specific Red Hat OpenShift Container Platform (RHOCP) version in use to ensure the service is configured appropriately based on the deployment requirements.
- For Red Hat CoreOS 4 and Red Hat OpenShift Container Platform 4, refer to the below
Troubleshooting operating system issuesdocumentation for each minor version:
For certain hardware and workloads the reserve memory for the crash kernel might not be enough, please set it as per requirement.
NOTES:
- Starting with RHOCP 4.13+ which is based on RHEL9 kernels,
crashkernel=autois not supported. To automatically reserve an appropriate amount of memory for the kdump kernel usecrashkernel=1G-4G:192M,4G-64G:256M,64G-:512Mvalue. For more information refer:crashkernel=auto parameter is deprecated in RHEL9. - The classic syntax like
crashkernel=size[@offset]is also available, though it is recommended to usecrashkernel=1G-4G:192M,4G-64G:256M,64G-:512M. If the classic syntax sets, please setauto_reset_crashkernel noto avoid overwriting the value when updatingkexec-tools. Otherwise this value will be overwritten to default value: See the KCS - When making a change to the main kdump configuration file (
/etc/kdump.conf), restarting the service is required via thesystemctl restart kdumpcommand. For clusters using OVN network a reboot is needed. - If the node is going to be rebooted later, this command can be skipped.
To configure kdump more extensively, or in non-standard environments, please refer to the Extended KDUMP Configurations section.
Contents
- Prerequisites
- Installing KDUMP
- Extended KDUMP Configurations
- Sizing Local Dump Targets
- Testing KDUMP
- Vmcore Capture Time
- Controlling which events trigger a Kernel Panic
Prerequisites
- For dumping cores to a network target, access to a server over NFS or SSH is required.
- Whether dumping locally or to a network target, a volume, device or directory with enough free disk space is needed to hold the core file. See the Sizing Local Dump Targets section for more information.
Installing KDUMP
Verify the kexec-tools package is installed on the node.
# rpm -q kexec-tools
RHCOS ships kexec-tools package by default. If it is not installed, please open a support case with Red Hat.
Extended KDUMP Configurations
If the system or environment requires an extended or non-standard kdump configuration, please refer to the below links:
- For RHEL and AWS EC2 Nitro, refer to Trigger a Kernel Panic on AWS EC2 Nitro Instances by NMI method
- To send an NMI to Azure guests, refer to How to send NMI to an Azure VM
- For encountering system hangs on VMware guests, refer to How to capture a vmcore of hung Red Hat Enterprise Linux VMware guest system using vmss2core tool?
- To send an NMI to VMware guests, refer to How to use VMWare ESX command line to force NMI panic on RHEL guest O/S
- For KVM and RHEV, refer to How to capture vmcore dump from a KVM guest?
Note: Though KVM and RHEV guests are not required to use the aforementioned method, it is an additional option for capturing a vmcore when the virtual guest is unresponsive.
Sizing Local Dump Targets
The size of the vmcore file, and therefore the amount of disk space necessary to store it, will mainly depend on the following:
- How much of the system’s RAM was in use at the moment of the kernel panic
- What type of data is stored on the RAM.
- The type of compression and the dump level stated in the “core_collector” parameter of the
/etc/kdump.conffile
In more recent RHCOS versions, and with the default compression level discarding pages not related to kernel memory, the average size of a vmcore is relatively small (when compared to total system RAM). Please refer to the latest user statistics in order to estimate the amount of free space to reserve for the dump target.
That being said, the only reliable way to guarantee that a full vmcore is generated is for the dump target to have free space at least equal in size to the physical RAM.
To determine the actual size of a vmcore, and to verify that the desired kdump configuration works, it is recommended to manually crash the system.
Note: Testing requires down time for the intended systems.
Testing KDUMP
Regular methods of kdump testing for all RHCOS versions:
-
After configuring kdump, please schedule down time for the relevant systems in order to manually test a system crash and to verify that a full vmcore is generated in the configured dump target.
Warning: These testing procedures will panic the kernel, killing all services on the machine. -
It is recommend to first test the kdump configuration by issuing a Kernel panic via the SysRq-Trigger.
The SysRq-Facility is a special key combination that, when enabled, allows the user to force a system’s kernel to respond to a specific command. This feature is mostly for troubleshooting kernel-related problems, or to force a response from a system while it is in a non-responsive state (hang).
-
After confirming a full vmcore is generated from a SysRq panic, it is recommend to continue testing by issuing a Non-Maskable Interrupt (NMI). This can be triggered by pushing an NMI button.
An NMI is an interrupt that is unable to be ignored by standard operating system mechanisms. It is generally used only for critical hardware errors. This feature can be used to signal an operating system when other standard input mechanisms (keyboard, ssh, network, etc.) have ceased to function.
- Triggering a panic via the NMI button is a more trustworthy method of obtaining a vmcore when the system hangs than using the SysRq-Facility trigger, as in some cases the NMI is able to force the system to respond even when standard keyboard input will not be accepted.
The preferred testing procedure is described below:
- Test the kdump configuration by using the SysRq-Facility to trigger a kernel panic. If kdump works correctly, the system is rebooted and a full vmcore is saved.
- If a full vmcore is saved, configure the NMI-related sysctl parameters.
- Reboot the system once to make sure the configuration is persistent.
- For testing the NMI button, push the button to trigger a kernel panic. If the NMI button works correctly, the system is rebooted and a full vmcore is saved.
Configuring and manually crashing a system:
Once it is verified that kdump is functional with the current configuration, the changes can be propagated to all the nodes via machine-config.
Make sure the node is in cordon state, this will make sure no workloads are scheduled.
# oc adm cordon <node-name>
SSH to the node and configure the SysRq-Facility to permit all triggers:
# sysctl -w kernel.sysrq=1
OR
# echo 1 > /proc/sys/kernel/sysrq
A panic can be triggered by issuing the echo c > /proc/sysrq-trigger command.
It can also be triggered using the <ALT> + <SysRq> + C key combination on the console.
- For more information about the SysRq-Facility, please refer to What is the SysRq-Facility and how do I use it?
Confirm a full vmcore is generated, and move on to configure the NMI related parameters.
If only an incomplete vmcore was saved, please refer to the Sizing Local Dump Targets and Diagnostic Steps sections.
To configure the kdump to panic and generate a vmcore when the NMI button is pushed, enter the following commands:
# vim /etc/sysctl.conf
…
kernel.unknown_nmi_panic = 1
kernel.panic_on_io_nmi = 1
kernel.panic_on_unrecovered_nmi = 1
Afterwards, reboot the system once and make sure the NMI configuration persisted.
Then, generate an NMI from the respective platform and verify that a full vmcore has been generated in the dump path.
- For more information on Non-Maskable Interrupts, please refer to An Introduction to Non-Maskable Interrupts (NMIs) and What is an NMI and what can I use it for? at our Knowledge Base.
Note: NMI functionality depends on the system's hardware or virtualization platform. If uncertain how to perform this action, contact the appropriate hardware or platform vendor.
Time required to capture vmcore
Dumping time depends on the options that are used for its configuration. Refer to How to determine the time required for dumping a vmcore file with kdump?
Controlling which events trigger a Kernel Panic
There are several parameters that control under which circumstances kdump is activated. Most of these can be enabled via sysctl tunable parameters, please refer to the most commonly used below.
When configuring a sysctl tunable via a sysctl.conf file, make sure to enforce the rule and make it persistent by issuing the sysctl -p <file path> command via sudo or the root user (if a file path is not specified, the default is /etc/sysctl.conf).
Note: While it is possible to enable multiple such tunables simultaneously, so as to make sure a vmcore is generated in as many scenarios as possible, please verify beforehand that each tunable is suitable for the expected workload and environment.
- The below kernel sysctl parameters cannot be changed to a different value than below if
protectKernelDefaultsis set totrueinkubelet.conf. FromRed Hat OpenShift Container Platform (RHOCP)4.13 on-wardskubelet.confhasprotectKernelDefaultsset totrueby default. To configure the parameters listed below, please setprotectKernelDefaultstofalseinkubelet.conf.
kernel.keys.root_maxbytes=25000000
kernel.keys.root_maxkeys=1000000
kernel.panic=10
kernel.panic_on_oops=1
vm.overcommit_memory=1
vm.panic_on_oom=0
System hangs due to an NMI
-
Occurs when a Non-Maskable Interrupt is issued, usually due to a hardware fault.
- To configure the kernel to panic when an NMI occurs, add the following to the sysctl.conf file:
# vim /etc/sysctl.conf … kernel.unknown_nmi_panic = 1 kernel.panic_on_io_nmi = 1 kernel.panic_on_unrecovered_nmi = 1- For more information on configuring the system to panic when an NMI is issued, please refer to How can I configure my system to crash when NMI switch is pushed?
Out of Memory (OOM) Kill event
-
Occurs when a memory request (Page Fault or kernel memory allocation) is made while not enough memory is available, thus the system terminates an active task (usually a non-prioritized process utilizing a lot of memory).
- To configure the kernel to panic when an OOM-Kill event occurs, add the following to the sysctl.conf file:
# vim /etc/sysctl.conf … vm.panic_on_oom = 1- For more information on configuring the system to panic at OOM-Kill, and other relevant tunables, refer to What are the sysctl tunables for the OOM Killer configuration, available for RHEL6 and later?
CPU Soft Lockup event
-
Occurs when a task is using the CPU for more than time the allowed threshold (the tunable kernel.watchdog_thresh, default is 20 seconds).
- To configure the kernel to panic when a CPU Soft Lockup occurs, add the following to the sysctl.conf file:
# vim /etc/sysctl.conf … kernel.softlockup_panic = 1- For more information on CPU Soft Lockups, refer to What is a CPU soft lockup? article.
- Note: This setting is not recommended for virtual machines, as virtual machines are more prone to soft lockups, especially when the hypervisor is overcommitted. For more information, refer to Virtual machine reports a "BUG: soft lockup" (or multiple at the same time).
Hung / Blocked Task event
-
Occurs when a process is stuck in Uninterruptible-Sleep (D-state) more time than the allowed threshold (the tunable kernel.hung_task_timeout_secs, default is 120 seconds).
- To configure the kernel to panic when a task becomes hung, add the following to the sysctl.conf file:
# vim /etc/sysctl.conf … kernel.hung_task_panic = 1- More information regarding the Hung Task Check mechanism and its relevant tunables, refer to How do I use hung task check in RHEL? Solution.
To configure kdump more extensively, or in non-standard environments, please refer to the Extended KDUMP Configurations section.
Diagnostic Steps
For issues with configuring kdump or generating a full vmcore, refer to the links below that cover common problems.
- Kdump fails to create dumpfile via SSH or NFS on OVN based clusters
- Kdump fails to capture a vmcore on KVM and Vmware systems
- How to configure kdump over remote target path (NFS) on Red Hat CoreOS
- Why kdump fails to generate vmcore on CoreOS with error from coreos-propagate-multipath-conf?
- Why does kdump fails to dump vmcore on OCP node with cipher_null-ecb type disk encryption?
- While testing kdump on RHCOS node, nodes falls in to emergeny shell on reboot
If the issues persist or unexpected behavior occurs, submit a new This content is not included.Technical Support case along with the data listed below:
- Make sure
kdumpservice is enabled and active.
# systemctl status kdump
● kdump.service - Crash recovery kernel arming
Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; preset: disabled)
Active: active (exited) since Wed 2025-05-14 16:21:37 UTC; 51s ago
Process: 1149 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCESS)
Main PID: 1149 (code=exited, status=0/SUCCESS)
CPU: 28.269s
- Make sure reserve memory for the crash kernel is set:
# cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-28af98682992630498bb0eb62633902be1a9fbd025565b98a6cb9270233aafe8/vmlinuz-5.14.0-284.104.1.el9_2.x86_64 ostree=/ostree/boot.1/rhcos/28af98682992630498bb0eb62633902be1a9fbd025565b98a6cb9270233aafe8/0 ignition.platform.id=openstack console=ttyS0,115200n8 console=tty0 root=UUID=8800aedb-a542-4e6d-b484-12ff1f2a62df rw rootflags=prjquota boot=UUID=48df506b-3723-4ca0-9f1b-656194c02723 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=1 crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M
- Make sure the
crashkernelparameter is also present inrpm-ostreedb. If not, on next reboot kdump will not be functional.
# rpm-ostree kargs
$ignition_firstboot ostree=/ostree/boot.1/rhcos/28af98682992630498bb0eb62633902be1a9fbd025565b98a6cb9270233aafe8/0 ignition.platform.id=openstack console=ttyS0,115200n8 console=tty0 root=UUID=8800aedb-a542-4e6d-b484-12ff1f2a62df rw rootflags=prjquota boot=UUID=48df506b-3723-4ca0-9f1b-656194c02723 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1="all" psi=1 crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M
- Check if reserve memory for crash kernel is enough:
# kdumpctl showmem
kdump: Reserved 256MB memory for crash kernel
# kdumpctl estimate
Reserved crashkernel: 256M
Recommended crashkernel: 256M
Kernel image size: 36M
Kernel modules size: 10M
Initramfs size: 65M
Runtime reservation: 64M
Large modules:
xfs: 2048000
kvm: 1138688
- Check
journalctllogs for any failures:
# journalctl -u kdump -f
May 14 16:21:31 worker-2 dracut[1665]: Stored kernel commandline:
May 14 16:21:31 worker-2 dracut[1665]: No dracut internal kernel commandline stored in the initramfs
May 14 16:21:31 worker-2 dracut[1665]: *** Install squash loader ***
May 14 16:21:32 worker-2 dracut[1665]: *** Squashing the files inside the initramfs ***
May 14 16:21:37 worker-2 dracut[1665]: *** Squashing the files inside the initramfs done ***
May 14 16:21:37 worker-2 dracut[1665]: *** Creating image file '/var/lib/kdump/initramfs-5.14.0-284.104.1.el9_2.x86_64kdump.img' ***
May 14 16:21:37 worker-2 dracut[1665]: *** Creating initramfs image file '/var/lib/kdump/initramfs-5.14.0-284.104.1.el9_2.x86_64kdump.img' done ***
May 14 16:21:37 worker-2 kdumpctl[1157]: kdump: kexec: loaded kdump kernel
May 14 16:21:37 worker-2 kdumpctl[1157]: kdump: Starting kdump: [OK]
May 14 16:21:37 worker-2 systemd[1]: Finished Crash recovery kernel arming.
- Make sure dump target is configured:
# cat /etc/kdump.conf | egrep -v "^$|#"
path /var/crash
core_collector makedumpfile -l --message-level 7 -d 31
- Make sure
kdump initramfshas the necessary modules and configuration files needed to capture vmcore:
# lsinitrd /var/lib/kdump/initramfs-$(uname -r)kdump.img | egrep -i "<module-name>|<config-file-name>
- Capture serial console logs at the time of crash.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.