System crashed with error 'INFO: task oracle:1230 blocked for more than 120 seconds'

Solution Verified - Updated 5 Aug 2024

Environment

Red Hat Enterprise Linux 6, 7, 8, 9
VMware guest

Issue

The virtual guest configured in VMware environment crashed with following error messages on console:

INFO: task oracle:1230 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
oracle        D 000000000000000f     0 1230      1 0x00000080
 ffff8804f1acbac8 0000000000000082 0000000000000000 ffffffff8108107d
 ffffea0019ccabf8 000000002e76a93d ffff880839c40938 ffff880839c40848
 ffff8805df8a1058 ffff8804f1acbfd8 000000000000fb88 ffff8805df8a1058
Call Trace:
 [<ffffffff8108107d>] ? del_timer+0x7d/0xe0
 [<ffffffff8150e3a3>] io_schedule+0x73/0xc0
 [<ffffffff811beabe>] __blockdev_direct_IO_newtrunc+0x6de/0xb30
 [<ffffffff811bef6e>] __blockdev_direct_IO+0x5e/0xd0
 [<ffffffff811bb340>] ? blkdev_get_blocks+0x0/0xc0
 [<ffffffff811bc407>] blkdev_direct_IO+0x57/0x60
 [<ffffffff811bb340>] ? blkdev_get_blocks+0x0/0xc0
 [<ffffffff8111bbbb>] generic_file_aio_read+0x6bb/0x700
 [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff81180e20>] ? do_sync_read+0x0/0x140
 [<ffffffff811bb953>] blkdev_aio_read+0x53/0xc0
 [<ffffffff811bb900>] ? blkdev_aio_read+0x0/0xc0
 [<ffffffff81180f1a>] do_sync_read+0xfa/0x140
 [<ffffffff8105af54>] ? thread_group_times+0x44/0xb0
 [<ffffffff81096ca0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8106b3be>] ? mmput+0x1e/0x120
 [<ffffffff8121bc86>] ? security_file_permission+0x16/0x20
 [<ffffffff81181805>] vfs_read+0xb5/0x1a0
 [<ffffffff81181b32>] sys_pread64+0x82/0xa0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Kernel panic - not syncing: hung_task: blocked tasks
Pid: 177, comm: khungtaskd Tainted: G        W  ---------------    2.6.32-358.11.1.el6.x86_64 #1

Resolution

Please check with VMware support to verify if there was any issue present from hypervisor level due to which the IO requests on VMware virtual disks were unable to complete? Are there any error messages for Virtual disks in hypervisor logs?

Root Cause

Oracle process was stuck in UN state waiting for completion of IO on VMware virtual disk, but due to the issue in the connectivity to VMware virtual disks the process was not able to complete the IO. As the kernel.hung_task_panic flag was enabled, once the khungtaskd kernel thread found that above Oracle process is stuck in UN state for more than 120 seconds, it had crashed the system.

The Oracle process was stuck in UN state for at least 2 minutes 26 seconds:
```
  [  0 00:00:00.020] [UN]  PID: 7397   TASK: ffff8805a10e6080  CPU: 8   COMMAND: "oracle"
  [  0 00:02:26.228] [UN]  PID: 1230  TASK: ffff8805df8a0aa0  CPU: 15  COMMAND: "oracle"		<---
```
kernel.hung_task_panic sysctl parameter was enabled:
```
  crash> p sysctl_hung_task_panic
  sysctl_hung_task_panic = $32 = 1
```

Diagnostic Steps

The logs in vmcore shows that Oracle processes were stuck in UN (uninterruptible) state waiting for completion of IO operations on VMware virtual disk /dev/sdl

  crash> bt 
  PID: 1230  TASK: ffff8805df8a0aa0  CPU: 15  COMMAND: "oracle"
   #0 [ffff8804f1acba08] schedule at ffffffff8150dbc2
   #1 [ffff8804f1acbad0] io_schedule at ffffffff8150e3a3
   #2 [ffff8804f1acbaf0] __blockdev_direct_IO_newtrunc at ffffffff811beabe
   #3 [ffff8804f1acbbd0] __blockdev_direct_IO at ffffffff811bef6e
   #4 [ffff8804f1acbc50] blkdev_direct_IO at ffffffff811bc407
   #5 [ffff8804f1acbc90] generic_file_aio_read at ffffffff8111bbbb
   #6 [ffff8804f1acbd70] blkdev_aio_read at ffffffff811bb953
   #7 [ffff8804f1acbdc0] do_sync_read at ffffffff81180f1a
   #8 [ffff8804f1acbef0] vfs_read at ffffffff81181805
   #9 [ffff8804f1acbf30] sys_pread64 at ffffffff81181b32
  #10 [ffff8804f1acbf80] system_call_fastpath at ffffffff8100b072
      RIP: 0000003e6640ee33  RSP: 00007fff9f9c8790  RFLAGS: 00000206
      RAX: 0000000000000011  RBX: ffffffff8100b072  RCX: 000000045a23e370
      RDX: 0000000000002000  RSI: 00000003950c8000  RDI: 0000000000000103
      RBP: 00007fff9f9c8df0   R8: 0000000000002000   R9: 00000003950c8000
      R10: 000000103dc14000  R11: 0000000000000246  R12: 000509d80132ff27
      R13: 0000000000000200  R14: 00007fff9f9c8ee8  R15: 0000000000000200
      ORIG_RAX: 0000000000000011  CS: 0033  SS: 002b
  
  crash> p ((struct gendisk *) 0xffff8808376c2c00)->disk_name
  $23 = "sdl\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000l"
  
  crash> p ((struct gendisk *) 0xffff8808376c2c00)->queue
  $24 = (struct request_queue *) 0xffff880836eb8be8

Disk /dev/sdl had 3 IO requests in flight:

  crash> p ((struct request_queue *) 0xffff880836eb8be8)->in_flight
  $30 = 
   {0, 3}

While checking more detailed information about statistics on /dev/sdl it reveals that there was high number of error count on it as marked in following snip:

  Ndx   Device               scsi_device        vendor       model            rev.     iorq-cnt done-cnt        err-cnt timeout  state
  --- -------------------- ------------------ ------------ ---------------- -------- -------- -------- ------ ------- -------- ------------
    0 3:0:0:0    sdj       0xFFFF880839C47800 VMware       Virtual disk     1.0      1300654227 1300654227 (  0)      210       -- RUNNING
    1 3:0:1:0    sdk       0xFFFF880839C43800 VMware       Virtual disk     1.0      1297540099 1297540099 (  0)      226       -- RUNNING
    2 3:0:2:0    sdl       0xFFFF880839C40800 VMware       Virtual disk     1.0      1375011630 1375011627 (  3)      200       -- RUNNING
  														^^^^^^^
  crash> p ((struct scsi_device* ) 0xFFFF880839C40800)->ioerr_cnt
  $39 = {
    counter = 200
  }

The high error count is not observed only for virtual disk /dev/sdl, but for all the VMware virtual disks we see huge error count:

  Ndx   Device               scsi_device        vendor       model            rev.     iorq-cnt done-cnt        err-cnt timeout  state
  --- -------------------- ------------------ ------------ ---------------- -------- -------- -------- ------ ------- -------- ------------
    0 3:0:0:0    sdj       0xFFFF880839C47800 VMware       Virtual disk     1.0      1300654227 1300654227 (  0)      210       -- RUNNING
    1 3:0:1:0    sdk       0xFFFF880839C43800 VMware       Virtual disk     1.0      1297540099 1297540099 (  0)      226       -- RUNNING
    2 3:0:2:0    sdl       0xFFFF880839C40800 VMware       Virtual disk     1.0      1375011630 1375011627 (  3)      200       -- RUNNING
    3 3:0:3:0    sdm       0xFFFF880839C3E800 VMware       Virtual disk     1.0      1280605918 1280605918 (  0)      269       -- RUNNING
    4 3:0:4:0    sdn       0xFFFF88083A7C5800 VMware       Virtual disk     1.0      1287548175 1287548175 (  0)      279       -- RUNNING
    9 3:0:10:0   sds       0xFFFF88083A79F800 VMware       Virtual disk     1.0      1413935164 1413935162 (  2)      270       -- RUNNING
  														^^^^^^^

The logs in vmcore also shows error some of the Vmware virtual disks were removed/disconnected:

  vmw_pvscsi: msg type: 0x1 - MSG RING: 1/0 (5) 
  vmw_pvscsi: msg: device removed at scsi0:5:0
  vmw_pvscsi: msg type: 0x1 - MSG RING: 2/1 (5) 
  vmw_pvscsi: msg: device removed at scsi0:10:0

As per above information it Oracle process (PID: 1230) was stuck in UN state waiting for completion of IO on VMware virtual disk devices, but due to there was some issue in the connecitivty to VMware virtual disks due to which process was not able to complete the IO.

SBR

Storage

Product(s)

Red Hat Enterprise Linux

Components

kernel

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.