RHEL host reboots on its own when all storage paths fail and 'kernel.hung_task_panic = 1'

Solution Unverified - Updated 5 Aug 2024

Environment

Red Hat Enterprise Linux (RHEL) 5, 6, 7, 8, 9
kernel.hung_task_panic = 1 in /etc/sysctl.conf, or /proc/sys/kernel/hung_task_panic contains the value 1
Utilizing device-mapper-multipath for managing redundant storage paths
- Device is configured to queue I/O through no_path_retry and/or queue_if_no_path

Issue

We're doing storage failure testing and when we take down all paths on one host, that host reboots on its own
A cluster node is losing all paths to a GFS2 storage device and then it stops responding and gets fenced
Why does a host reboot with kernel.hung_task_panic when storage devices become inaccessible?

Resolution

It is recommended that kernel.hung_task_panic be disabled, except as needed for specific situations in which a problem is being diagnosed.

To prevent processes from blocking indefinitely waiting on I/O to return when all paths have failed (which in turn can trigger kernel.hung_task_panic), set a finite limit to queueing on the multipath devices in question using a reasonable no_path_retry value.

Root Cause

When kernel.hung_task_panic is set, any task that is blocked in the kernel for kernel.hung_task_secs will trigger a kernel panic.

When all paths to a multipath device fail, if that device is configured with indefinite queueing (no_path_retry queue or features "1 queue_if_no_path"), or a high queueing value (no_path_retry 200, for example), then processes waiting on I/O from affected devices may block for longer than kernel.hung_task_secs, causing a panic.

SBR

Storage

Product(s)

Red Hat Enterprise Linux

Components

Category

Troubleshoot

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.