The hung_task timer fired on a task, but there are much older blocked tasks - why didn't they trip hung_task timer?
Environment
- Red Hat Enterprise Linux
Issue
- One process tripped hung_task since it was in D-state (uninterruptible sleep - 'UN') for longer than hung_task_timeout_secs. However, there were other processes blocked in the D-state for much longer time. Why didn't the oldest task trip the hung task timer?
- System panics with
Kernel panic - not syncing: hung_task: blocked tasksbut the task listed is not the oldest task. - How does the hung_task timer decide which task to check?
Resolution
- Due to the frequency (every hung_task_timeout_secs seconds) of hung_task checks, often the task triggering a hung_task backtrace or panic will not be the oldest blocked task in the system. This is due to how the Linux kernel hung_task timer logic works.
Root Cause
- The hung_task timer code only runs every
hung_task_timeout_secs. Each time the check runs, it enumerates tasks in a specific order, based on the init task and other tasks linked from there. Depending on when the task was started, it may get checked earlier or later than other tasks.
Diagnostic Steps
The hung_task timer fired against khugepaged, pid 143
crash> log | tail -32
INFO: task khugepaged:143 blocked for more than 120 seconds.
Not tainted 2.6.32-504.8.1.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
khugepaged D 0000000000000006 0 143 2 0x00000000
ffff8802f8af1c88 0000000000000046 0000000000000000 ffff8802f8af1be0
ffff8800321958c0 0000000000000000 0000157d87c45d80 00000000000158c0
0000000000000000 0000000101641712 ffff8802f8aefaf8 ffff8802f8af1fd8
Call Trace:
[<ffffffff810873dc>] ? lock_timer_base+0x3c/0x70
[<ffffffff8152cb05>] rwsem_down_failed_common+0x95/0x1d0
[<ffffffff8152cc96>] rwsem_down_read_failed+0x26/0x30
[<ffffffff810874f0>] ? process_timeout+0x0/0x10
[<ffffffff81298a94>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffff8152c194>] ? down_read+0x24/0x30
[<ffffffff811867cf>] khugepaged+0x18f/0x1310
[<ffffffff810096f0>] ? __switch_to+0xd0/0x320
[<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81186640>] ? khugepaged+0x0/0x1310
[<ffffffff8109e66e>] kthread+0x9e/0xc0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffff8109e5d0>] ? kthread+0x0/0xc0
[<ffffffff8100c200>] ? child_rip+0x0/0x20
Kernel panic - not syncing: hung_task: blocked tasks
Pid: 139, comm: khungtaskd Not tainted 2.6.32-504.8.1.el6.x86_64 #1
Call Trace:
[<ffffffff815292d6>] ? panic+0xa7/0x16f
[<ffffffff810e9f56>] ? watchdog+0x246/0x250
[<ffffffff810e9d10>] ? watchdog+0x0/0x250
[<ffffffff8109e66e>] ? kthread+0x9e/0xc0
[<ffffffff8100c20a>] ? child_rip+0xa/0x20
[<ffffffff8109e5d0>] ? kthread+0x0/0xc0
[<ffffffff8100c200>] ? child_rip+0x0/0x20
This task was clearly blocked longer than the hung_task_timeout_secs interval
crash> ps -m 143
[0 00:02:14.385] [UN] PID: 143 TASK: ffff8802f8aef540 CPU: 6 COMMAND: "khugepaged"
crash> p sysctl_hung_task_timeout_secs
sysctl_hung_task_timeout_secs = $47 = 120
However, there were tasks much older. Why didn't these trigger the hung_task timer sooner?
crash> ps -m | grep UN | tail
[0 00:03:44.501] [UN] PID: 27552 TASK: ffff880037b10aa0 CPU: 7 COMMAND: "RMI TCP Connect"
[0 00:03:47.100] [UN] PID: 9472 TASK: ffff88024115a040 CPU: 1 COMMAND: "scheduler_Worke"
[0 00:03:47.755] [UN] PID: 7043 TASK: ffff8802ee075500 CPU: 2 COMMAND: "flush-253:9"
[0 00:03:48.388] [UN] PID: 1290 TASK: ffff8802f86b5540 CPU: 5 COMMAND: "jbd2/dm-2-8"
[0 00:03:48.745] [UN] PID: 8135 TASK: ffff88027409caa0 CPU: 9 COMMAND: "Replication Thr"
[0 00:03:52.593] [UN] PID: 7895 TASK: ffff8802bf2eeaa0 CPU: 2 COMMAND: "(VM Periodic Ta"
[0 00:03:53.271] [UN] PID: 1274 TASK: ffff8802f797a040 CPU: 7 COMMAND: "jbd2/dm-13-8"
[0 00:03:53.387] [UN] PID: 9439 TASK: ffff8802bf141500 CPU: 7 COMMAND: "pool-8-thread-1"
[0 00:03:53.990] [UN] PID: 8136 TASK: ffff880274153500 CPU: 5 COMMAND: "profile.data"
[0 00:03:58.495] [UN] PID: 27568 TASK: ffff8802f977a080 CPU: 2 COMMAND: "Thread-4055"
Linux kernel source code for hung_task
The hung_task timer code is called from check_hung_uninterruptible_tasks in khungtaskd kernel thread:
kernel/hung_task.c
198 static int watchdog(void *dummy)
199 {
200 set_user_nice(current, 0);
201
202 for ( ; ; ) {
203 unsigned long timeout = sysctl_hung_task_timeout_secs;
204
205 while (schedule_timeout_interruptible(timeout_jiffies(timeout)))
206 timeout = sysctl_hung_task_timeout_secs;
207
208--> check_hung_uninterruptible_tasks(timeout);
209 }
210
211 return 0;
212 }
This means the hung_task check only runs every sysctl_hung_task_timeout_secs seconds.
Now what does the check actually do?
137 static void check_hung_uninterruptible_tasks(unsigned long timeout)
138 {
139 int max_count = sysctl_hung_task_check_count;
140 int batch_count = HUNG_TASK_BATCHING;
141 struct task_struct *g, *t;
142
143 /*
144 * If the system crashed already then all bets are off,
145 * do not report extra hung tasks:
146 */
147 if (test_taint(TAINT_DIE) || did_panic)
148 return;
149
150 rcu_read_lock();
151--> do_each_thread(g, t) {
152 if (!--max_count)
153 goto unlock;
154 if (!--batch_count) {
155 batch_count = HUNG_TASK_BATCHING;
156 rcu_lock_break(g, t);
157 /* Exit if t or g was unhashed during refresh. */
158 if (t->state == TASK_DEAD || g->state == TASK_DEAD)
159 goto unlock;
160 }
161 /* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */
162 if (t->state == TASK_UNINTERRUPTIBLE)
163 check_hung_task(t, timeout);
164 } while_each_thread(g, t);
We enumerate tasks with the do_each_thread() function. How does that work?
include/linux/sched.h
2383 #define next_task(p) \
2384 list_entry_rcu((p)->tasks.next, struct task_struct, tasks)
2385
2386 #define for_each_process(p) \
2387 for (p = &init_task ; (p = next_task(p)) != &init_task ; )
2388
2389 extern bool current_is_single_threaded(void);
2390
2391 /*
2392 * Careful: do_each_thread/while_each_thread is a double loop so
2393 * 'break' will not work as expected - use goto instead.
2394 */
2395 #define do_each_thread(g, t) \
2396--> for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do
This just starts with the init_task, and goes through the tasks, and does the hung_task check on them. So the order of tasks checked depends on the order of the linked list starting at init_task. So which one is the first one in the blocked list?
crash> p &init_task.tasks
$45 = (struct list_head *) 0xffffffff81a8d468 <init_task+1096>
crash> list -o task_struct.tasks -H 0xffffffff81a8d468 | awk '{ print "ps -m "$1 }' > do_each_thread.cmd
crash> < do_each_thread.cmd | grep UN
[0 00:02:14.385] [UN] PID: 143 TASK: ffff8802f8aef540 CPU: 6 COMMAND: "khugepaged"
[0 00:03:53.271] [UN] PID: 1274 TASK: ffff8802f797a040 CPU: 7 COMMAND: "jbd2/dm-13-8"
[0 00:03:42.380] [UN] PID: 1282 TASK: ffff8802fa3daae0 CPU: 2 COMMAND: "jbd2/dm-9-8"
[0 00:03:48.388] [UN] PID: 1290 TASK: ffff8802f86b5540 CPU: 5 COMMAND: "jbd2/dm-2-8"
[0 00:03:20.194] [UN] PID: 1292 TASK: ffff8802f717c080 CPU: 3 COMMAND: "jbd2/dm-10-8"
[0 00:02:19.420] [UN] PID: 1294 TASK: ffff8802f7232aa0 CPU: 9 COMMAND: "jbd2/dm-11-8"
[0 00:03:15.243] [UN] PID: 1298 TASK: ffff8802fa789500 CPU: 4 COMMAND: "jbd2/dm-3-8"
[0 00:01:46.056] [UN] PID: 1979 TASK: ffff8802f9726aa0 CPU: 10 COMMAND: "sssd_be"
[0 00:03:21.173] [UN] PID: 5329 TASK: ffff8802fa3d6aa0 CPU: 7 COMMAND: "master"
[0 00:02:39.587] [UN] PID: 5371 TASK: ffff8802ee0d8ae0 CPU: 9 COMMAND: "osad"
[0 00:02:25.044] [UN] PID: 5430 TASK: ffff8802f3923540 CPU: 9 COMMAND: "nmon.sh"
[0 00:03:47.755] [UN] PID: 7043 TASK: ffff8802ee075500 CPU: 2 COMMAND: "flush-253:9"
[0 00:03:21.288] [UN] PID: 27094 TASK: ffff88012c8eeae0 CPU: 6 COMMAND: "pickup"
[0 00:03:25.641] [UN] PID: 27492 TASK: ffff8801816c5540 CPU: 3 COMMAND: "vmstat"
[0 00:02:05.606] [UN] PID: 27589 TASK: ffff88011f5e3540 CPU: 11 COMMAND: "logrotate"
[0 00:02:05.442] [UN] PID: 27591 TASK: ffff8802418cb500 CPU: 4 COMMAND: "sadc"
[0 00:02:05.493] [UN] PID: 27592 TASK: ffff88018fe07500 CPU: 7 COMMAND: "sh"
[0 00:02:05.451] [UN] PID: 27593 TASK: ffff88018fed9540 CPU: 4 COMMAND: "date"
It's the one that panics the system due to exceeding the 120 seconds.
But is this abnormal or can it be explained by how the hung_task timer works? Recall the hung_task logic only fires every sysctl_hung_task_timeout_secs.
crash> p sysctl_hung_task_timeout_secs
sysctl_hung_task_timeout_secs = $46 = 120
Let's backup 120 seconds from now. This would mean the khugepaged would have only have been blocked 14 seconds, instead of 2 minutes 14 seconds:
[0 00:00:14.385] [UN] PID: 143 TASK: ffff8802f8aef540 CPU: 6 COMMAND: "khugepaged"
Now consider the oldest task in the context of 120 seconds prior.
crash> ps -m | grep UN | tail -1
[0 00:03:58.495] [UN] PID: 27568 TASK: ffff8802f977a080 CPU: 2 COMMAND: "Thread-4055"
Time prior to now when hung_task check fired, the above task would have been 120's earlier, so this means it would have been blocked for 2 minutes less than the current time:
[0 00:01:58.495] [UN] PID: 27568 TASK: ffff8802f977a080 CPU: 2 COMMAND: "Thread-4055"
The above is just under the 2 minute threshold, so it would not have exceeded sysctl_hung_task_timeout_secs so the hung_task timer would not have fired.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.