RHEL host reboots on its own when all storage paths fail and 'kernel.hung_task_panic = 1'
Environment
- Red Hat Enterprise Linux (RHEL) 5, 6, 7, 8, 9
kernel.hung_task_panic = 1in/etc/sysctl.conf, or/proc/sys/kernel/hung_task_paniccontains the value1- Utilizing
device-mapper-multipathfor managing redundant storage paths- Device is configured to queue I/O through
no_path_retryand/orqueue_if_no_path
- Device is configured to queue I/O through
Issue
- We're doing storage failure testing and when we take down all paths on one host, that host reboots on its own
- A cluster node is losing all paths to a GFS2 storage device and then it stops responding and gets fenced
- Why does a host reboot with
kernel.hung_task_panicwhen storage devices become inaccessible?
Resolution
It is recommended that kernel.hung_task_panic be disabled, except as needed for specific situations in which a problem is being diagnosed.
To prevent processes from blocking indefinitely waiting on I/O to return when all paths have failed (which in turn can trigger kernel.hung_task_panic), set a finite limit to queueing on the multipath devices in question using a reasonable no_path_retry value.
Root Cause
When kernel.hung_task_panic is set, any task that is blocked in the kernel for kernel.hung_task_secs will trigger a kernel panic.
When all paths to a multipath device fail, if that device is configured with indefinite queueing (no_path_retry queue or features "1 queue_if_no_path"), or a high queueing value (no_path_retry 200, for example), then processes waiting on I/O from affected devices may block for longer than kernel.hung_task_secs, causing a panic.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.