Kdump fails to capture a vmcore Openshift 4.16+ nodes
Environment
- Red Hat CoreOS (RHCOS)
- Red Hat Openshift Container Platform (RHOCP) 4.16 and above
Issue
- On Openshift (OCP) 4.16 or newer, when kdump is enabled per the instructions in the troubleshooting documentation, the service starts properly, but on a crash, does not write out a vmcore file.
- On the node after pivoting to the kdump execution on the console, the following kernel panic or similar is visible:
/usr/bin/sh: error while loading shared libraries: libtinfo.so.6: cannot open shared object file: No such file or directory
[ 16.073358] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00
[ 16.074564] CPU: 0 PID: 1 Comm: init Not tainted 5.14.0-427.42.1.el9_4.x86_64 #1
[ 16.075809] Hardware name: Red Hat KVM/RHEL, BIOS 1.16.3-2.el9 04/01/2014
[ 16.076978] Call Trace:
[ 16.077431] <TASK>
[ 16.077776] dump_stack_lvl+0x34/0x48
[ 16.078426] panic+0x107/0x2f7
[ 16.078934] do_exit.cold+0x15/0x15
[ 16.079524] do_group_exit+0x2d/0x90
[ 16.080131] __x64_sys_exit_group+0x14/0x20
[ 16.080813] do_syscall_64+0x5c/0x90
[ 16.081498] ? srso_return_thunk+0x5/0x5f
[ 16.082212] ? do_syscall_64+0x69/0x90
[ 16.082831] ? srso_return_thunk+0x5/0x5f
[ 16.083476] ? syscall_exit_to_user_mode+0x19/0x40
[ 16.084242] ? srso_return_thunk+0x5/0x5f
[ 16.084840] ? do_syscall_64+0x69/0x90
[ 16.085428] ? srso_return_thunk+0x5/0x5f
[ 16.086064] ? exit_to_user_mode_loop+0xbc/0x130
[ 16.086866] ? srso_return_thunk+0x5/0x5f
[ 16.087562] ? exit_to_user_mode_prepare+0xb6/0x100
[ 16.088353] ? srso_return_thunk+0x5/0x5f
[ 16.089047] ? syscall_exit_to_user_mode+0x19/0x40
[ 16.089971] ? srso_return_thunk+0x5/0x5f
[ 16.090619] ? do_syscall_64+0x69/0x90
[ 16.091233] ? srso_return_thunk+0x5/0x5f
[ 16.091851] ? syscall_exit_to_user_mode+0x19/0x40
[ 16.092640] ? srso_return_thunk+0x5/0x5f
[ 16.093307] ? do_syscall_64+0x69/0x90
[ 16.093901] ? do_syscall_64+0x69/0x90
[ 16.094513] entry_SYSCALL_64_after_hwframe+0x77/0xe1
Resolution
- Increase the memory reservation of 'crashkernel' parameter from 256 MB to 512 MB and try to reproduce the issue.
- Following steps can be followed to apply the changes:
1. # rpm-ostree kargs --delete="crashkernel=256M" --append="crashkernel=512M"
2. # rpm-ostree kargs --> Verify the crashkernel memory value to confirm the updated setting.
3. # systemctl reboot --> This will reboot the system to have that take effect.
- After the node is rebooted, to test:
1. # 'echo 1 > /proc/sys/kernel/sysrq --> This will enable the 'sysrq-trigger' functionality.
2. # 'echo c > /proc/sysrq-trigger' --> This will panic and reboot the system.
Root Cause
- The ignition modules in the CoreOS kdump initramfs in OCP 4.16+ are larger than previous releases. Red Hat engineering is working on a solution to address this and shrink the generated image to fit in a smaller memory reservation.
That work is being tracked in the following issue:
This content is not included.This content is not included.https://issues.redhat.com/browse/OCPBUGS-44368
Diagnostic Steps
- Capture the node console messages from panicking the system until the kernel panic.
Components
Category
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.