Multiple nodes in the cluster panicked after hung task warnings

Solution Unverified - Updated

Environment

  • Red Hat Enterprise Linux (RHEL) 5 (Update 5 or later)
  • Red Hat Enterprise Linux (RHEL) 6, 7, 8, 9
  • High Availability Add On
  • sysctl parameter kernel.hung_task_panic = 1 (/proc/sys/kernel/hung_task_panic contains value 1)

Issue

  • Multiple servers in GFS2 cluster crashed
  • Several nodes panicked after fencing failed for several minutes.
  • Several nodes kernel panicked after reporting hung task warnings such as:
Jan  5 23:30:45 node1 kernel: INFO: task gfs2_quotad:25161 blocked for more than 120 seconds.
Jan  5 23:30:45 node1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  5 23:30:45 node1 kernel: gfs2_quotad   D ffffffff80156347     0 25161   1291         25169 25160 (L-TLB)
Jan  5 23:30:45 node1 kernel:  ffff81105c2cdcc0 0000000000000046 0000000000000000 ffff810838eb3000
Jan  5 23:30:45 node1 kernel:  0000000000000018 000000000000000a ffff81102eefb7e0 ffff81103f88c040
Jan  5 23:30:45 node1 kernel:  00040df462aee4fd 0000000000007eb8 ffff81102eefb9c8 00000028889bb97c
Jan  5 23:30:45 node1 kernel: Call Trace:
Jan  5 23:30:45 node1 kernel:  [<ffffffff889ba00f>] :dlm:dlm_lock+0x117/0x129
[...]

Resolution

Unset the sysctl parameter kernel.hung_task_panic, and/or correct whatever caused processes to become blocked for kernel.hung_task_timeout_secs.

If post_fail_delay, a fencedevice delay, or any other condition exists in the cluster that would cause fencing-time to exceed the kernel.hung_task_timeout_secs, then those timings would need to be adjusted to avoid having a panic any time fencing is needed and exposes this condition.

Root Cause

When the sysctl parameter kernel.hung_task_panic is set, if any process is blocked in the kernel for kernel.hung_task_timeout_secs, the system will panic. This is generally only used as a diagnostic procedure to capture a core to diagnose a process becoming blocked, but in a cluster is can be especially problematic.

If fencing fails in a cluster, it will cause operations like GFS2 access or service management to block. If the failure to fence remains, processes blocked can eventually trigger the hung_task_panic, causing nodes to panic. As such, its best to avoid setting this parameter on cluster nodes.

Diagnostic Steps

  • Look for signs in /var/log/messages like failed fencing or cluster membership events that may trigger processes to block. If found, identify their cause and resolve them.
SBR
Components

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.