VM panics with blocked tasks due to slow or stalled I/O
Environment
- Red Hat Enterprise Linux 5, 6, 7, 8, 9
- VM guest
Issue
-
System rebooted due to blocked tasks
-
Frequent reboot (unexpected). Trending show system rebooted once every 2-3 days, usually at night around 12 - 3 am (GMT+0800). No abnormalities found from log. No hardware errors. No application errors. System is in cloud environment.
-
VM guest panicked with the following:
INFO: task kjournald:911 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kjournald D ffff81000901d7a0 0 911 279 936 873 (L-TLB) ffff81083de83cf0 0000000000000046 ffff81083fafc080 ffff81083e0cc540 0000000dd7e412dc 000000000000000a ffff81083fafc080 ffff81011cfc8080 000000115554a5db 00000000000020cd ffff81083fafc268 000000033e6daa38 Call Trace: [<ffffffff8006f1f5>] do_gettimeofday+0x40/0x90 [<ffffffff800154ce>] sync_buffer+0x0/0x3f [<ffffffff800647ea>] io_schedule+0x3f/0x67 [<ffffffff80015509>] sync_buffer+0x3b/0x3f [<ffffffff80064a16>] __wait_on_bit+0x40/0x6e [<ffffffff800154ce>] sync_buffer+0x0/0x3f [<ffffffff80064ab0>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff800a1bd2>] wake_bit_function+0x0/0x23 [<ffffffff880339a5>] :jbd:journal_commit_transaction+0x543/0x1066 [<ffffffff8003ddd5>] lock_timer_base+0x1b/0x3c [<ffffffff8004b6b6>] try_to_del_timer_sync+0x7f/0x88 [<ffffffff880375d3>] :jbd:kjournald+0xc1/0x213 [<ffffffff800a1ba4>] autoremove_wake_function+0x0/0x2e [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4 [<ffffffff88037512>] :jbd:kjournald+0x0/0x213 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032bdc>] kthread+0xfe/0x132 [<ffffffff8005efb1>] child_rip+0xa/0x11 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032ade>] kthread+0x0/0x132 [<ffffffff8005efa7>] child_rip+0x0/0x11 Kernel panic - not syncing: hung_task: blocked tasks
Resolution
-
Increase the hung task watchdog threshold so that it is twice the time of the I/O timeout value, for example:
# cat /sys/block/sda/device/timeout 180 # # 180 x 2 = 360 # echo 360 > /proc/sys/kernel/hung_task_timeout_secs -
Unless it is required disable the hung_task_panic setting to avoid unnecessary panics with:
# echo 0 > /proc/sys/kernel/hung_task_panic -
Check to see if there is a systemic storage side issue present within the hypervisor that causes frequent io latency issues for the virtual machine.
Root Cause
The root cause of this panic was due to some delays within the hypervisor completing outstanding io quickly, resulting in some tasks being blocked for more than 2 minutes waiting for I/O to compete. The I/O timeout value (180s) is larger than the hung task watchdog timeout value (120s), resulting in false triggering of hung task logic. The hung task watchdog detected these blocked tasks and because the hung_task_panic setting was enabled it paniced the system. If the hung_task_timeout_secs had been set to a value greater than the I/O timeout value, a panic is much less likely to happen.
Beyond individual I/O taking too long to complete within the hypervisor, other causes for long running blocked tasks include flushing large amounts of dirty filesystem pages in conjunction with file journaling that result in disk congestion. Often it is not possible to get an exact picture from a vmcore as to why there were I/O delays. However, the most common cause is having a hung task timeout value that is less than the device I/O timeout value.
The default I/O timeout value is 60 seconds and the default hung_task_timeout_secs is 120s, or twice the I/O timeout value. This allows two full I/O retries with I/O timeouts before the task is declared hung. If the I/O timeout value is increased, the hung_task_timeout_secs should also be correspondingly increased.
With an vm guest I/O timeout value of 180 seconds and a hung task watchdog timer threshold of 120 seconds, I/O will not be able to use the full 180s I/O timeout period before the blocked task event is output to the console and messages file, and if hunt_task_panic is enabled, have the system panic. With the I/O timeout period set to 180 seconds, the hung task timeout needs to be > 180s to give the I/O timeout logic a chance to work before the hung task watchdog potentially panics the system.
Diagnostic Steps
Looking at the vmcore:
KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.18-194.el5/vmlinux
DUMPFILE: /cores/retrace/tasks/246254048/crash/vmcore [PARTIAL DUMP]
CPUS: 4
DATE: Mon Dec 9 16:43:16 2013
UPTIME: 00:04:01
LOAD AVERAGE: 1.89, 0.89, 0.34
TASKS: 230
NODENAME: <nodename>
RELEASE: 2.6.18-194.el5
VERSION: #1 SMP Tue Mar 16 21:52:39 EDT 2010
MACHINE: x86_64 (2264 Mhz)
MEMORY: 31.5 GB
PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"
INFO: task kjournald:911 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald D ffff81000901d7a0 0 911 279 936 873 (L-TLB)
ffff81083de83cf0 0000000000000046 ffff81083fafc080 ffff81083e0cc540
0000000dd7e412dc 000000000000000a ffff81083fafc080 ffff81011cfc8080
000000115554a5db 00000000000020cd ffff81083fafc268 000000033e6daa38
Call Trace:
[<ffffffff8006f1f5>] do_gettimeofday+0x40/0x90
[<ffffffff800154ce>] sync_buffer+0x0/0x3f
[<ffffffff800647ea>] io_schedule+0x3f/0x67
[<ffffffff80015509>] sync_buffer+0x3b/0x3f
[<ffffffff80064a16>] __wait_on_bit+0x40/0x6e
[<ffffffff800154ce>] sync_buffer+0x0/0x3f
[<ffffffff80064ab0>] out_of_line_wait_on_bit+0x6c/0x78
[<ffffffff800a1bd2>] wake_bit_function+0x0/0x23
[<ffffffff880339a5>] :jbd:journal_commit_transaction+0x543/0x1066
[<ffffffff8003ddd5>] lock_timer_base+0x1b/0x3c
[<ffffffff8004b6b6>] try_to_del_timer_sync+0x7f/0x88
[<ffffffff880375d3>] :jbd:kjournald+0xc1/0x213
[<ffffffff800a1ba4>] autoremove_wake_function+0x0/0x2e
[<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
[<ffffffff88037512>] :jbd:kjournald+0x0/0x213
[<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
[<ffffffff80032bdc>] kthread+0xfe/0x132
[<ffffffff8005efb1>] child_rip+0xa/0x11
[<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
[<ffffffff80032ade>] kthread+0x0/0x132
[<ffffffff8005efa7>] child_rip+0x0/0x11
Kernel panic - not syncing: hung_task: blocked tasks
The system panicked because hung_task_panic was set and the hung task watchdog detected a task that was blocked for more than 2 minutes.
crash> set 911
PID: 911
COMMAND: "kjournald"
TASK: ffff81083fafc080 [THREAD_INFO: ffff81083de82000]
CPU: 3
STATE: TASK_UNINTERRUPTIBLE
crash> bt
PID: 911 TASK: ffff81083fafc080 CPU: 3 COMMAND: "kjournald"
#0 [ffff81083de83c20] schedule at ffffffff80063f96
#1 [ffff81083de83cf8] io_schedule at ffffffff800647ea
#2 [ffff81083de83d18] sync_buffer at ffffffff80015509
#3 [ffff81083de83d28] __wait_on_bit at ffffffff80064a16
#4 [ffff81083de83d68] out_of_line_wait_on_bit at ffffffff80064ab0
#5 [ffff81083de83dd8] journal_commit_transaction at ffffffff880339a5 [jbd]
#6 [ffff81083de83e78] kjournald at ffffffff880375d3 [jbd]
#7 [ffff81083de83ee8] kthread at ffffffff80032bdc
#8 [ffff81083de83f48] kernel_thread at ffffffff8005efb1
crash> task -R last_ran
PID: 911 TASK: ffff81083fafc080 CPU: 3 COMMAND: "kjournald"
last_ran = 74446054875,
crash> runq | grep "CPU 3"
CPU 3 RUNQUEUE: ffff81000901d7a0
crash> rq.timestamp_last_tick ffff81000901d7a0
timestamp_last_tick = 247583975453
crash> pd (247583975453-74446054875)
$1 = 173137920578
crash> pd (247583975453-74446054875)/1000000000
$2 = 173
It's been blocked for 173 seconds.
It's a journal task that is waiting for all previously submitted I/O to complete. We'll need to check how much I/O is outstanding and if/why the requests are not completing.
Continuing the analysis, tracing the I/O path:
Pull the journal pointer off the stack:
crash> whatis journal_commit_transaction
void journal_commit_transaction(journal_t *);
crash> dis -rl ffffffff880375d3
...
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/jbd/journal.c: 151
0xffffffff880375cb <kjournald+185>: mov %rbx,%rdi
0xffffffff880375ce <kjournald+188>: callq 0xffffffff88033462 <journal_commit_transaction>
0xffffffff880375d3 <kjournald+193>: jmp 0xffffffff8803759d <kjournald+139>
crash> dis -rl ffffffff880339a5
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/jbd/commit.c: 303
0xffffffff88033462 <journal_commit_transaction>: push %r15
0xffffffff88033464 <journal_commit_transaction+2>: push %r14
0xffffffff88033466 <journal_commit_transaction+4>: push %r13
0xffffffff88033468 <journal_commit_transaction+6>: push %r12
0xffffffff8803346a <journal_commit_transaction+8>: push %rbp
0xffffffff8803346b <journal_commit_transaction+9>: mov %rdi,%rbp
0xffffffff8803346e <journal_commit_transaction+12>: push %rbx
...
crash> bt -f
PID: 911 TASK: ffff81083fafc080 CPU: 3 COMMAND: "kjournald"
...
#5 [ffff81083de83dd8] journal_commit_transaction at ffffffff880339a5 [jbd]
ffff81083de83de0: 0000000dd7e891c7 ffff81083e615000
ffff81083de83df0: ffff810800000000 0000001200000000
ffff81083de83e00: 0000000300000000 ffffffff80309b60
ffff81083de83e10: 0000000000000282 ffffffff8003ddd5
ffff81083de83e20: 00000000ffffffff ffff81083d96e590
ffff81083de83e30: 0000000000000282 ffffffff8004b6b6
ffff81083de83e40: 0000000000000246 ffff81083d96e400
ffff81083de83e50: ffff81083d96e424 ffff81083d96e590
ffff81083de83e60: ffff81083d96e4c0 0000000000000000
ffff81083de83e70: ffff81083d96e490 ffffffff880375d3
#6 [ffff81083de83e78] kjournald at ffffffff880375d3 [jbd]
...
crash> journal_t.j_dev,j_fs_dev ffff81083d96e400
j_dev = 0xffff81083e24b200
j_fs_dev = 0xffff81083e24b200
crash> block_device.bd_disk 0xffff81083e24b200
bd_disk = 0xffff81083e1f7a00
crash> gendisk 0xffff81083e1f7a00
struct gendisk {
major = 253,
first_minor = 0,
minors = 1,
disk_name = "dm-0\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
part = 0x0 <__mod_license950>,
part_uevent_suppress = 0,
fops = 0xffffffff88159e00 <dm_blk_dops>,
queue = 0xffff81083f1a2048,
private_data = 0xffff81083e1f7c00,
...
stamp = 4294807248,
in_flight = 12,
dkstats = 0x7ef7c196cfff
}
12 I/Os still in flight on this device (dm-0).
crash> request_queue.in_flight,nr_sorted,queuedata 0xffff81083f1a2048
in_flight = 0
nr_sorted = 0
queuedata = 0xffff81083e1f7c00
crash> mapped_device.map 0xffff81083e1f7c00
map = 0xffff81083e1f7600
crash> dm_table.num_targets,targets 0xffff81083e1f7600
num_targets = 1
targets = 0xffffc20000031080
crash> dm_target.type,private 0xffffc20000031080
type = 0xffff81083e567980
private = 0xffff81083e143d00
crash> target_type.name 0xffff81083e567980
name = 0xffffffff881511cc "linear"
crash> linear_c.dev 0xffff81083e143d00
dev = 0xffff81083e53a580
crash> dm_dev.bdev 0xffff81083e53a580
bdev = 0xffff81083e09e800
crash> block_device.bd_disk 0xffff81083e09e800
bd_disk = 0xffff81083e630000
crash> gendisk 0xffff81083e630000
struct gendisk {
major = 8,
first_minor = 0,
minors = 16,
disk_name = "sda\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
part = 0xffff81083f1255c0,
part_uevent_suppress = 0,
fops = 0xffffffff880ac8e0 <sd_fops>,
queue = 0xffff81083e78d928,
private_data = 0xffff81083e630200,
...
stamp = 4294908305,
in_flight = 2,
dkstats = 0x7ef7c1a4d7ff
}
The I/O was issued to device /dev/sda and there's currently 2 requests in flight.
crash> request_queue.rq,in_flight,nr_sorted 0xffff81083e78d928
rq = {
count = {0, 2},
starved = {0, 0},
elvpriv = 2,
...
in_flight = 1
nr_sorted = 1
One request has been sent to the sda device and one is still in the scheduler.
crash> request_queue.queuedata 0xffff81083e78d928
queuedata = 0xffff81083f1ba000
crash> scsi_device 0xffff81083f1ba000
struct scsi_device {
host = 0xffff81083e5b4000,
...
vendor = 0xffff81083e53ac88 "VMware Virtual disk 1.0 t0/target0:0:0",
model = 0xffff81083e53ac90 "Virtual disk 1.0 t0/target0:0:0",
...
iorequest_cnt = {
counter = 7914
},
iodone_cnt = {
counter = 7913
},
ioerr_cnt = {
counter = 449
},
timeout = 180000,
It's a virtual device in a virtual machine and if the host system is busy then it could starve the guest of cpu time and stall the completion of I/O.
The timeout for the I/O request is 180 seconds yet the hung task watchdog has a threshold set to 120 seconds - this doesn't make much sense since it doesn't give the scsi device a chance to time out the request. I highly recommend increasing the hung task watchdog timeout to around 300 seconds.
# echo 300 > /proc/sys/kernel/hung_task_timeout_secs
Unless it is required I also suggest disabling the hung_task_panic setting to avoid unnecessary panics:
# echo 0 > /proc/sys/kernel/hung_task_panic
Looking at when the I/O requests were last issued:
crash> jiffies
jiffies = $7 = 4294908305
From dm-0:
crash> pd (4294908305-4294807248)/1000
$10 = 101
An I/O request was issued to dm-0 101 seconds ago.
From sda:
crash> pd 4294908305-4294908305
$8 = 0
An I/O request was issued to sda very recently (less than a second ago).
It does not appear that any of the I/Os are getting stuck on the guest and this must be a case of disk congestion that may have been exacerbated by a busy host system.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.