The hung_task timer fired on a task, but there are much older blocked tasks - why didn't they trip hung_task timer?

Solution Unverified - Updated 5 Aug 2024

Environment

Red Hat Enterprise Linux

Issue

One process tripped hung_task since it was in D-state (uninterruptible sleep - 'UN') for longer than hung_task_timeout_secs. However, there were other processes blocked in the D-state for much longer time. Why didn't the oldest task trip the hung task timer?
System panics with Kernel panic - not syncing: hung_task: blocked tasks but the task listed is not the oldest task.
How does the hung_task timer decide which task to check?

Resolution

Due to the frequency (every hung_task_timeout_secs seconds) of hung_task checks, often the task triggering a hung_task backtrace or panic will not be the oldest blocked task in the system. This is due to how the Linux kernel hung_task timer logic works.

Root Cause

The hung_task timer code only runs every hung_task_timeout_secs. Each time the check runs, it enumerates tasks in a specific order, based on the init task and other tasks linked from there. Depending on when the task was started, it may get checked earlier or later than other tasks.

Diagnostic Steps

The hung_task timer fired against khugepaged, pid 143

crash> log | tail -32
INFO: task khugepaged:143 blocked for more than 120 seconds.
      Not tainted 2.6.32-504.8.1.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
khugepaged    D 0000000000000006     0   143      2 0x00000000
 ffff8802f8af1c88 0000000000000046 0000000000000000 ffff8802f8af1be0
 ffff8800321958c0 0000000000000000 0000157d87c45d80 00000000000158c0
 0000000000000000 0000000101641712 ffff8802f8aefaf8 ffff8802f8af1fd8
Call Trace:
 [<ffffffff810873dc>] ? lock_timer_base+0x3c/0x70
 [<ffffffff8152cb05>] rwsem_down_failed_common+0x95/0x1d0
 [<ffffffff8152cc96>] rwsem_down_read_failed+0x26/0x30
 [<ffffffff810874f0>] ? process_timeout+0x0/0x10
 [<ffffffff81298a94>] call_rwsem_down_read_failed+0x14/0x30
 [<ffffffff8152c194>] ? down_read+0x24/0x30
 [<ffffffff811867cf>] khugepaged+0x18f/0x1310
 [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
 [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81186640>] ? khugepaged+0x0/0x1310
 [<ffffffff8109e66e>] kthread+0x9e/0xc0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
Kernel panic - not syncing: hung_task: blocked tasks
Pid: 139, comm: khungtaskd Not tainted 2.6.32-504.8.1.el6.x86_64 #1
Call Trace:
 [<ffffffff815292d6>] ? panic+0xa7/0x16f
 [<ffffffff810e9f56>] ? watchdog+0x246/0x250
 [<ffffffff810e9d10>] ? watchdog+0x0/0x250
 [<ffffffff8109e66e>] ? kthread+0x9e/0xc0
 [<ffffffff8100c20a>] ? child_rip+0xa/0x20
 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

This task was clearly blocked longer than the hung_task_timeout_secs interval

crash> ps -m 143
[0 00:02:14.385] [UN]  PID: 143    TASK: ffff8802f8aef540  CPU: 6   COMMAND: "khugepaged"
crash> p sysctl_hung_task_timeout_secs
sysctl_hung_task_timeout_secs = $47 = 120

However, there were tasks much older. Why didn't these trigger the hung_task timer sooner?

crash> ps -m | grep UN | tail 
[0 00:03:44.501] [UN]  PID: 27552  TASK: ffff880037b10aa0  CPU: 7   COMMAND: "RMI TCP Connect"
[0 00:03:47.100] [UN]  PID: 9472   TASK: ffff88024115a040  CPU: 1   COMMAND: "scheduler_Worke"
[0 00:03:47.755] [UN]  PID: 7043   TASK: ffff8802ee075500  CPU: 2   COMMAND: "flush-253:9"
[0 00:03:48.388] [UN]  PID: 1290   TASK: ffff8802f86b5540  CPU: 5   COMMAND: "jbd2/dm-2-8"
[0 00:03:48.745] [UN]  PID: 8135   TASK: ffff88027409caa0  CPU: 9   COMMAND: "Replication Thr"
[0 00:03:52.593] [UN]  PID: 7895   TASK: ffff8802bf2eeaa0  CPU: 2   COMMAND: "(VM Periodic Ta"
[0 00:03:53.271] [UN]  PID: 1274   TASK: ffff8802f797a040  CPU: 7   COMMAND: "jbd2/dm-13-8"
[0 00:03:53.387] [UN]  PID: 9439   TASK: ffff8802bf141500  CPU: 7   COMMAND: "pool-8-thread-1"
[0 00:03:53.990] [UN]  PID: 8136   TASK: ffff880274153500  CPU: 5   COMMAND: "profile.data"
[0 00:03:58.495] [UN]  PID: 27568  TASK: ffff8802f977a080  CPU: 2   COMMAND: "Thread-4055"

Linux kernel source code for hung_task

The hung_task timer code is called from check_hung_uninterruptible_tasks in khungtaskd kernel thread:

kernel/hung_task.c
198 static int watchdog(void *dummy)
199 {
200     set_user_nice(current, 0);
201 
202     for ( ; ; ) {
203             unsigned long timeout = sysctl_hung_task_timeout_secs;
204 
205             while (schedule_timeout_interruptible(timeout_jiffies(timeout)))
206                     timeout = sysctl_hung_task_timeout_secs;
207 
208-->          check_hung_uninterruptible_tasks(timeout);
209     }
210 
211     return 0;
212 }

This means the hung_task check only runs every sysctl_hung_task_timeout_secs seconds.

Now what does the check actually do?

137 static void check_hung_uninterruptible_tasks(unsigned long timeout)
138 {
139     int max_count = sysctl_hung_task_check_count;
140     int batch_count = HUNG_TASK_BATCHING;
141     struct task_struct *g, *t;
142 
143     /*
144      * If the system crashed already then all bets are off,
145      * do not report extra hung tasks:
146      */
147     if (test_taint(TAINT_DIE) || did_panic)
148             return;
149 
150     rcu_read_lock();
151-->  do_each_thread(g, t) {
152             if (!--max_count)
153                     goto unlock;
154             if (!--batch_count) {
155                     batch_count = HUNG_TASK_BATCHING;
156                     rcu_lock_break(g, t);
157                     /* Exit if t or g was unhashed during refresh. */
158                     if (t->state == TASK_DEAD || g->state == TASK_DEAD)
159                             goto unlock;
160             }
161             /* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */
162             if (t->state == TASK_UNINTERRUPTIBLE)
163                     check_hung_task(t, timeout);
164     } while_each_thread(g, t);

We enumerate tasks with the do_each_thread() function. How does that work?

include/linux/sched.h
2383 #define next_task(p) \
2384    list_entry_rcu((p)->tasks.next, struct task_struct, tasks)
2385 
2386 #define for_each_process(p) \
2387    for (p = &init_task ; (p = next_task(p)) != &init_task ; )
2388 
2389 extern bool current_is_single_threaded(void);
2390 
2391 /*
2392  * Careful: do_each_thread/while_each_thread is a double loop so
2393  *          'break' will not work as expected - use goto instead.
2394  */
2395 #define do_each_thread(g, t) \
2396--> for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do

This just starts with the init_task, and goes through the tasks, and does the hung_task check on them. So the order of tasks checked depends on the order of the linked list starting at init_task. So which one is the first one in the blocked list?

crash> p &init_task.tasks
$45 = (struct list_head *) 0xffffffff81a8d468 <init_task+1096>

crash> list -o task_struct.tasks -H 0xffffffff81a8d468 | awk '{ print "ps -m "$1 }' > do_each_thread.cmd
crash> < do_each_thread.cmd | grep UN
[0 00:02:14.385] [UN]  PID: 143    TASK: ffff8802f8aef540  CPU: 6   COMMAND: "khugepaged"
[0 00:03:53.271] [UN]  PID: 1274   TASK: ffff8802f797a040  CPU: 7   COMMAND: "jbd2/dm-13-8"
[0 00:03:42.380] [UN]  PID: 1282   TASK: ffff8802fa3daae0  CPU: 2   COMMAND: "jbd2/dm-9-8"
[0 00:03:48.388] [UN]  PID: 1290   TASK: ffff8802f86b5540  CPU: 5   COMMAND: "jbd2/dm-2-8"
[0 00:03:20.194] [UN]  PID: 1292   TASK: ffff8802f717c080  CPU: 3   COMMAND: "jbd2/dm-10-8"
[0 00:02:19.420] [UN]  PID: 1294   TASK: ffff8802f7232aa0  CPU: 9   COMMAND: "jbd2/dm-11-8"
[0 00:03:15.243] [UN]  PID: 1298   TASK: ffff8802fa789500  CPU: 4   COMMAND: "jbd2/dm-3-8"
[0 00:01:46.056] [UN]  PID: 1979   TASK: ffff8802f9726aa0  CPU: 10  COMMAND: "sssd_be"
[0 00:03:21.173] [UN]  PID: 5329   TASK: ffff8802fa3d6aa0  CPU: 7   COMMAND: "master"
[0 00:02:39.587] [UN]  PID: 5371   TASK: ffff8802ee0d8ae0  CPU: 9   COMMAND: "osad"
[0 00:02:25.044] [UN]  PID: 5430   TASK: ffff8802f3923540  CPU: 9   COMMAND: "nmon.sh"
[0 00:03:47.755] [UN]  PID: 7043   TASK: ffff8802ee075500  CPU: 2   COMMAND: "flush-253:9"
[0 00:03:21.288] [UN]  PID: 27094  TASK: ffff88012c8eeae0  CPU: 6   COMMAND: "pickup"
[0 00:03:25.641] [UN]  PID: 27492  TASK: ffff8801816c5540  CPU: 3   COMMAND: "vmstat"
[0 00:02:05.606] [UN]  PID: 27589  TASK: ffff88011f5e3540  CPU: 11  COMMAND: "logrotate"
[0 00:02:05.442] [UN]  PID: 27591  TASK: ffff8802418cb500  CPU: 4   COMMAND: "sadc"
[0 00:02:05.493] [UN]  PID: 27592  TASK: ffff88018fe07500  CPU: 7   COMMAND: "sh"
[0 00:02:05.451] [UN]  PID: 27593  TASK: ffff88018fed9540  CPU: 4   COMMAND: "date"

It's the one that panics the system due to exceeding the 120 seconds.

But is this abnormal or can it be explained by how the hung_task timer works? Recall the hung_task logic only fires every sysctl_hung_task_timeout_secs.

crash> p sysctl_hung_task_timeout_secs
sysctl_hung_task_timeout_secs = $46 = 120

Let's backup 120 seconds from now. This would mean the khugepaged would have only have been blocked 14 seconds, instead of 2 minutes 14 seconds:

[0 00:00:14.385] [UN]  PID: 143    TASK: ffff8802f8aef540  CPU: 6   COMMAND: "khugepaged"

Now consider the oldest task in the context of 120 seconds prior.

crash> ps -m | grep UN | tail -1
[0 00:03:58.495] [UN]  PID: 27568  TASK: ffff8802f977a080  CPU: 2   COMMAND: "Thread-4055"

Time prior to now when hung_task check fired, the above task would have been 120's earlier, so this means it would have been blocked for 2 minutes less than the current time:

[0 00:01:58.495] [UN]  PID: 27568  TASK: ffff8802f977a080  CPU: 2   COMMAND: "Thread-4055"

The above is just under the 2 minute threshold, so it would not have exceeded sysctl_hung_task_timeout_secs so the hung_task timer would not have fired.

SBR

Kernel

Product(s)

Red Hat Enterprise Linux

Components

kernel

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.