VM panics with blocked tasks due to slow or stalled I/O

Solution Unverified - Updated 6 Aug 2024

Environment

Red Hat Enterprise Linux 5, 6, 7, 8, 9
VM guest

Issue

System rebooted due to blocked tasks
Frequent reboot (unexpected). Trending show system rebooted once every 2-3 days, usually at night around 12 - 3 am (GMT+0800). No abnormalities found from log. No hardware errors. No application errors. System is in cloud environment.

VM guest panicked with the following:

INFO: task kjournald:911 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald     D ffff81000901d7a0     0   911    279           936   873 (L-TLB)
 ffff81083de83cf0 0000000000000046 ffff81083fafc080 ffff81083e0cc540
 0000000dd7e412dc 000000000000000a ffff81083fafc080 ffff81011cfc8080
 000000115554a5db 00000000000020cd ffff81083fafc268 000000033e6daa38
Call Trace:
 [<ffffffff8006f1f5>] do_gettimeofday+0x40/0x90
 [<ffffffff800154ce>] sync_buffer+0x0/0x3f
 [<ffffffff800647ea>] io_schedule+0x3f/0x67
 [<ffffffff80015509>] sync_buffer+0x3b/0x3f
 [<ffffffff80064a16>] __wait_on_bit+0x40/0x6e
 [<ffffffff800154ce>] sync_buffer+0x0/0x3f
 [<ffffffff80064ab0>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff800a1bd2>] wake_bit_function+0x0/0x23
 [<ffffffff880339a5>] :jbd:journal_commit_transaction+0x543/0x1066
 [<ffffffff8003ddd5>] lock_timer_base+0x1b/0x3c
 [<ffffffff8004b6b6>] try_to_del_timer_sync+0x7f/0x88
 [<ffffffff880375d3>] :jbd:kjournald+0xc1/0x213
 [<ffffffff800a1ba4>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff88037512>] :jbd:kjournald+0x0/0x213
 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032bdc>] kthread+0xfe/0x132
 [<ffffffff8005efb1>] child_rip+0xa/0x11
 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032ade>] kthread+0x0/0x132
 [<ffffffff8005efa7>] child_rip+0x0/0x11
Kernel panic - not syncing: hung_task: blocked tasks

Resolution

Increase the hung task watchdog threshold so that it is twice the time of the I/O timeout value, for example:

# cat /sys/block/sda/device/timeout
180
# # 180 x 2 = 360
# echo 360 > /proc/sys/kernel/hung_task_timeout_secs

Unless it is required disable the hung_task_panic setting to avoid unnecessary panics with:
```
# echo 0 > /proc/sys/kernel/hung_task_panic
```
Check to see if there is a systemic storage side issue present within the hypervisor that causes frequent io latency issues for the virtual machine.

Root Cause

The root cause of this panic was due to some delays within the hypervisor completing outstanding io quickly, resulting in some tasks being blocked for more than 2 minutes waiting for I/O to compete. The I/O timeout value (180s) is larger than the hung task watchdog timeout value (120s), resulting in false triggering of hung task logic. The hung task watchdog detected these blocked tasks and because the hung_task_panic setting was enabled it paniced the system. If the hung_task_timeout_secs had been set to a value greater than the I/O timeout value, a panic is much less likely to happen.

Beyond individual I/O taking too long to complete within the hypervisor, other causes for long running blocked tasks include flushing large amounts of dirty filesystem pages in conjunction with file journaling that result in disk congestion. Often it is not possible to get an exact picture from a vmcore as to why there were I/O delays. However, the most common cause is having a hung task timeout value that is less than the device I/O timeout value.

The default I/O timeout value is 60 seconds and the default hung_task_timeout_secs is 120s, or twice the I/O timeout value. This allows two full I/O retries with I/O timeouts before the task is declared hung. If the I/O timeout value is increased, the hung_task_timeout_secs should also be correspondingly increased.

With an vm guest I/O timeout value of 180 seconds and a hung task watchdog timer threshold of 120 seconds, I/O will not be able to use the full 180s I/O timeout period before the blocked task event is output to the console and messages file, and if hunt_task_panic is enabled, have the system panic. With the I/O timeout period set to 180 seconds, the hung task timeout needs to be > 180s to give the I/O timeout logic a chance to work before the hung task watchdog potentially panics the system.

Diagnostic Steps

Looking at the vmcore:

      KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.18-194.el5/vmlinux
    DUMPFILE: /cores/retrace/tasks/246254048/crash/vmcore  [PARTIAL DUMP]
        CPUS: 4
        DATE: Mon Dec  9 16:43:16 2013
      UPTIME: 00:04:01
LOAD AVERAGE: 1.89, 0.89, 0.34
       TASKS: 230
    NODENAME: <nodename>
     RELEASE: 2.6.18-194.el5
     VERSION: #1 SMP Tue Mar 16 21:52:39 EDT 2010
     MACHINE: x86_64  (2264 Mhz)
      MEMORY: 31.5 GB
       PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"

INFO: task kjournald:911 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald     D ffff81000901d7a0     0   911    279           936   873 (L-TLB)
 ffff81083de83cf0 0000000000000046 ffff81083fafc080 ffff81083e0cc540
 0000000dd7e412dc 000000000000000a ffff81083fafc080 ffff81011cfc8080
 000000115554a5db 00000000000020cd ffff81083fafc268 000000033e6daa38
Call Trace:
 [<ffffffff8006f1f5>] do_gettimeofday+0x40/0x90
 [<ffffffff800154ce>] sync_buffer+0x0/0x3f
 [<ffffffff800647ea>] io_schedule+0x3f/0x67
 [<ffffffff80015509>] sync_buffer+0x3b/0x3f
 [<ffffffff80064a16>] __wait_on_bit+0x40/0x6e
 [<ffffffff800154ce>] sync_buffer+0x0/0x3f
 [<ffffffff80064ab0>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff800a1bd2>] wake_bit_function+0x0/0x23
 [<ffffffff880339a5>] :jbd:journal_commit_transaction+0x543/0x1066
 [<ffffffff8003ddd5>] lock_timer_base+0x1b/0x3c
 [<ffffffff8004b6b6>] try_to_del_timer_sync+0x7f/0x88
 [<ffffffff880375d3>] :jbd:kjournald+0xc1/0x213
 [<ffffffff800a1ba4>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff88037512>] :jbd:kjournald+0x0/0x213
 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032bdc>] kthread+0xfe/0x132
 [<ffffffff8005efb1>] child_rip+0xa/0x11
 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032ade>] kthread+0x0/0x132
 [<ffffffff8005efa7>] child_rip+0x0/0x11

Kernel panic - not syncing: hung_task: blocked tasks

The system panicked because hung_task_panic was set and the hung task watchdog detected a task that was blocked for more than 2 minutes.

crash> set 911
    PID: 911
COMMAND: "kjournald"
   TASK: ffff81083fafc080  [THREAD_INFO: ffff81083de82000]
    CPU: 3
  STATE: TASK_UNINTERRUPTIBLE 

crash> bt
PID: 911    TASK: ffff81083fafc080  CPU: 3   COMMAND: "kjournald"
 #0 [ffff81083de83c20] schedule at ffffffff80063f96
 #1 [ffff81083de83cf8] io_schedule at ffffffff800647ea
 #2 [ffff81083de83d18] sync_buffer at ffffffff80015509
 #3 [ffff81083de83d28] __wait_on_bit at ffffffff80064a16
 #4 [ffff81083de83d68] out_of_line_wait_on_bit at ffffffff80064ab0
 #5 [ffff81083de83dd8] journal_commit_transaction at ffffffff880339a5 [jbd]
 #6 [ffff81083de83e78] kjournald at ffffffff880375d3 [jbd]
 #7 [ffff81083de83ee8] kthread at ffffffff80032bdc
 #8 [ffff81083de83f48] kernel_thread at ffffffff8005efb1

crash> task -R last_ran
PID: 911    TASK: ffff81083fafc080  CPU: 3   COMMAND: "kjournald"
  last_ran = 74446054875, 

crash> runq | grep "CPU 3"
CPU 3 RUNQUEUE: ffff81000901d7a0

crash> rq.timestamp_last_tick ffff81000901d7a0
  timestamp_last_tick = 247583975453

crash> pd (247583975453-74446054875)
$1 = 173137920578

crash> pd (247583975453-74446054875)/1000000000
$2 = 173

It's been blocked for 173 seconds.

It's a journal task that is waiting for all previously submitted I/O to complete. We'll need to check how much I/O is outstanding and if/why the requests are not completing.

Continuing the analysis, tracing the I/O path:

Pull the journal pointer off the stack:

crash> whatis journal_commit_transaction
void journal_commit_transaction(journal_t *);

crash> dis -rl ffffffff880375d3
...
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/jbd/journal.c: 151
0xffffffff880375cb <kjournald+185>:     mov    %rbx,%rdi
0xffffffff880375ce <kjournald+188>:     callq  0xffffffff88033462 <journal_commit_transaction>
0xffffffff880375d3 <kjournald+193>:     jmp    0xffffffff8803759d <kjournald+139>

crash> dis -rl ffffffff880339a5
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/jbd/commit.c: 303
0xffffffff88033462 <journal_commit_transaction>:        push   %r15
0xffffffff88033464 <journal_commit_transaction+2>:      push   %r14
0xffffffff88033466 <journal_commit_transaction+4>:      push   %r13
0xffffffff88033468 <journal_commit_transaction+6>:      push   %r12
0xffffffff8803346a <journal_commit_transaction+8>:      push   %rbp
0xffffffff8803346b <journal_commit_transaction+9>:      mov    %rdi,%rbp
0xffffffff8803346e <journal_commit_transaction+12>:     push   %rbx
...

crash> bt -f
PID: 911    TASK: ffff81083fafc080  CPU: 3   COMMAND: "kjournald"
...
 #5 [ffff81083de83dd8] journal_commit_transaction at ffffffff880339a5 [jbd]
    ffff81083de83de0: 0000000dd7e891c7 ffff81083e615000 
    ffff81083de83df0: ffff810800000000 0000001200000000 
    ffff81083de83e00: 0000000300000000 ffffffff80309b60 
    ffff81083de83e10: 0000000000000282 ffffffff8003ddd5 
    ffff81083de83e20: 00000000ffffffff ffff81083d96e590 
    ffff81083de83e30: 0000000000000282 ffffffff8004b6b6 
    ffff81083de83e40: 0000000000000246 ffff81083d96e400 
    ffff81083de83e50: ffff81083d96e424 ffff81083d96e590 
    ffff81083de83e60: ffff81083d96e4c0 0000000000000000 
    ffff81083de83e70: ffff81083d96e490 ffffffff880375d3 
 #6 [ffff81083de83e78] kjournald at ffffffff880375d3 [jbd]
...

crash> journal_t.j_dev,j_fs_dev ffff81083d96e400
  j_dev = 0xffff81083e24b200
  j_fs_dev = 0xffff81083e24b200

crash> block_device.bd_disk 0xffff81083e24b200
  bd_disk = 0xffff81083e1f7a00

crash> gendisk 0xffff81083e1f7a00
struct gendisk {
  major = 253, 
  first_minor = 0, 
  minors = 1, 
  disk_name = "dm-0\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
  part = 0x0 <__mod_license950>, 
  part_uevent_suppress = 0, 
  fops = 0xffffffff88159e00 <dm_blk_dops>, 
  queue = 0xffff81083f1a2048, 
  private_data = 0xffff81083e1f7c00, 
...
  stamp = 4294807248, 
  in_flight = 12, 
  dkstats = 0x7ef7c196cfff
}

12 I/Os still in flight on this device (dm-0).

crash> request_queue.in_flight,nr_sorted,queuedata 0xffff81083f1a2048
  in_flight = 0
  nr_sorted = 0
  queuedata = 0xffff81083e1f7c00

crash> mapped_device.map 0xffff81083e1f7c00
  map = 0xffff81083e1f7600

crash> dm_table.num_targets,targets 0xffff81083e1f7600
  num_targets = 1
  targets = 0xffffc20000031080

crash> dm_target.type,private 0xffffc20000031080
  type = 0xffff81083e567980
  private = 0xffff81083e143d00

crash> target_type.name 0xffff81083e567980
  name = 0xffffffff881511cc "linear"

crash> linear_c.dev 0xffff81083e143d00
  dev = 0xffff81083e53a580

crash> dm_dev.bdev 0xffff81083e53a580
  bdev = 0xffff81083e09e800

crash> block_device.bd_disk 0xffff81083e09e800
  bd_disk = 0xffff81083e630000

crash> gendisk 0xffff81083e630000
struct gendisk {
  major = 8, 
  first_minor = 0, 
  minors = 16, 
  disk_name = "sda\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
  part = 0xffff81083f1255c0, 
  part_uevent_suppress = 0, 
  fops = 0xffffffff880ac8e0 <sd_fops>, 
  queue = 0xffff81083e78d928, 
  private_data = 0xffff81083e630200, 
...
  stamp = 4294908305, 
  in_flight = 2, 
  dkstats = 0x7ef7c1a4d7ff
}

The I/O was issued to device /dev/sda and there's currently 2 requests in flight.

crash> request_queue.rq,in_flight,nr_sorted 0xffff81083e78d928
  rq = {
    count = {0, 2}, 
    starved = {0, 0}, 
    elvpriv = 2, 
...
  in_flight = 1
  nr_sorted = 1

One request has been sent to the sda device and one is still in the scheduler.

crash> request_queue.queuedata 0xffff81083e78d928
  queuedata = 0xffff81083f1ba000

crash> scsi_device 0xffff81083f1ba000
struct scsi_device {
  host = 0xffff81083e5b4000, 
...
  vendor = 0xffff81083e53ac88 "VMware  Virtual disk    1.0 t0/target0:0:0", 
  model = 0xffff81083e53ac90 "Virtual disk    1.0 t0/target0:0:0", 
...
  iorequest_cnt = {
    counter = 7914
  }, 
  iodone_cnt = {
    counter = 7913
  }, 
  ioerr_cnt = {
    counter = 449
  }, 
  timeout = 180000,

It's a virtual device in a virtual machine and if the host system is busy then it could starve the guest of cpu time and stall the completion of I/O.

The timeout for the I/O request is 180 seconds yet the hung task watchdog has a threshold set to 120 seconds - this doesn't make much sense since it doesn't give the scsi device a chance to time out the request. I highly recommend increasing the hung task watchdog timeout to around 300 seconds.

# echo 300 > /proc/sys/kernel/hung_task_timeout_secs

Unless it is required I also suggest disabling the hung_task_panic setting to avoid unnecessary panics:

# echo 0 > /proc/sys/kernel/hung_task_panic

Looking at when the I/O requests were last issued:

crash> jiffies
jiffies = $7 = 4294908305

From dm-0:

crash> pd (4294908305-4294807248)/1000
$10 = 101

An I/O request was issued to dm-0 101 seconds ago.

From sda:

crash> pd 4294908305-4294908305
$8 = 0

An I/O request was issued to sda very recently (less than a second ago).

It does not appear that any of the I/Os are getting stuck on the guest and this must be a case of disk congestion that may have been exacerbated by a busy host system.

SBR

Product(s)

Red Hat Enterprise Linux

Components

kernel

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.