System crashed with error 'INFO: task oracle:1230 blocked for more than 120 seconds'
Environment
- Red Hat Enterprise Linux 6, 7, 8, 9
- VMware guest
Issue
- The virtual guest configured in VMware environment crashed with following error messages on console:
INFO: task oracle:1230 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
oracle D 000000000000000f 0 1230 1 0x00000080
ffff8804f1acbac8 0000000000000082 0000000000000000 ffffffff8108107d
ffffea0019ccabf8 000000002e76a93d ffff880839c40938 ffff880839c40848
ffff8805df8a1058 ffff8804f1acbfd8 000000000000fb88 ffff8805df8a1058
Call Trace:
[<ffffffff8108107d>] ? del_timer+0x7d/0xe0
[<ffffffff8150e3a3>] io_schedule+0x73/0xc0
[<ffffffff811beabe>] __blockdev_direct_IO_newtrunc+0x6de/0xb30
[<ffffffff811bef6e>] __blockdev_direct_IO+0x5e/0xd0
[<ffffffff811bb340>] ? blkdev_get_blocks+0x0/0xc0
[<ffffffff811bc407>] blkdev_direct_IO+0x57/0x60
[<ffffffff811bb340>] ? blkdev_get_blocks+0x0/0xc0
[<ffffffff8111bbbb>] generic_file_aio_read+0x6bb/0x700
[<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
[<ffffffff81180e20>] ? do_sync_read+0x0/0x140
[<ffffffff811bb953>] blkdev_aio_read+0x53/0xc0
[<ffffffff811bb900>] ? blkdev_aio_read+0x0/0xc0
[<ffffffff81180f1a>] do_sync_read+0xfa/0x140
[<ffffffff8105af54>] ? thread_group_times+0x44/0xb0
[<ffffffff81096ca0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff8106b3be>] ? mmput+0x1e/0x120
[<ffffffff8121bc86>] ? security_file_permission+0x16/0x20
[<ffffffff81181805>] vfs_read+0xb5/0x1a0
[<ffffffff81181b32>] sys_pread64+0x82/0xa0
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Kernel panic - not syncing: hung_task: blocked tasks
Pid: 177, comm: khungtaskd Tainted: G W --------------- 2.6.32-358.11.1.el6.x86_64 #1
Resolution
- Please check with VMware support to verify if there was any issue present from hypervisor level due to which the IO requests on VMware virtual disks were unable to complete? Are there any error messages for Virtual disks in hypervisor logs?
Root Cause
-
Oracle process was stuck in UN state waiting for completion of IO on VMware virtual disk, but due to the issue in the connectivity to VMware virtual disks the process was not able to complete the IO. As the
kernel.hung_task_panicflag was enabled, once thekhungtaskdkernel thread found that above Oracle process is stuck in UN state for more than 120 seconds, it had crashed the system.The Oracle process was stuck in UN state for at least 2 minutes 26 seconds:
[ 0 00:00:00.020] [UN] PID: 7397 TASK: ffff8805a10e6080 CPU: 8 COMMAND: "oracle" [ 0 00:02:26.228] [UN] PID: 1230 TASK: ffff8805df8a0aa0 CPU: 15 COMMAND: "oracle" <---kernel.hung_task_panicsysctl parameter was enabled:crash> p sysctl_hung_task_panic sysctl_hung_task_panic = $32 = 1
Diagnostic Steps
-
The logs in vmcore shows that Oracle processes were stuck in UN (uninterruptible) state waiting for completion of IO operations on VMware virtual disk
/dev/sdlcrash> bt PID: 1230 TASK: ffff8805df8a0aa0 CPU: 15 COMMAND: "oracle" #0 [ffff8804f1acba08] schedule at ffffffff8150dbc2 #1 [ffff8804f1acbad0] io_schedule at ffffffff8150e3a3 #2 [ffff8804f1acbaf0] __blockdev_direct_IO_newtrunc at ffffffff811beabe #3 [ffff8804f1acbbd0] __blockdev_direct_IO at ffffffff811bef6e #4 [ffff8804f1acbc50] blkdev_direct_IO at ffffffff811bc407 #5 [ffff8804f1acbc90] generic_file_aio_read at ffffffff8111bbbb #6 [ffff8804f1acbd70] blkdev_aio_read at ffffffff811bb953 #7 [ffff8804f1acbdc0] do_sync_read at ffffffff81180f1a #8 [ffff8804f1acbef0] vfs_read at ffffffff81181805 #9 [ffff8804f1acbf30] sys_pread64 at ffffffff81181b32 #10 [ffff8804f1acbf80] system_call_fastpath at ffffffff8100b072 RIP: 0000003e6640ee33 RSP: 00007fff9f9c8790 RFLAGS: 00000206 RAX: 0000000000000011 RBX: ffffffff8100b072 RCX: 000000045a23e370 RDX: 0000000000002000 RSI: 00000003950c8000 RDI: 0000000000000103 RBP: 00007fff9f9c8df0 R8: 0000000000002000 R9: 00000003950c8000 R10: 000000103dc14000 R11: 0000000000000246 R12: 000509d80132ff27 R13: 0000000000000200 R14: 00007fff9f9c8ee8 R15: 0000000000000200 ORIG_RAX: 0000000000000011 CS: 0033 SS: 002b crash> p ((struct gendisk *) 0xffff8808376c2c00)->disk_name $23 = "sdl\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000l" crash> p ((struct gendisk *) 0xffff8808376c2c00)->queue $24 = (struct request_queue *) 0xffff880836eb8be8Disk
/dev/sdlhad 3 IO requests in flight:crash> p ((struct request_queue *) 0xffff880836eb8be8)->in_flight $30 = {0, 3} -
While checking more detailed information about statistics on
/dev/sdlit reveals that there was high number of error count on it as marked in following snip:Ndx Device scsi_device vendor model rev. iorq-cnt done-cnt err-cnt timeout state --- -------------------- ------------------ ------------ ---------------- -------- -------- -------- ------ ------- -------- ------------ 0 3:0:0:0 sdj 0xFFFF880839C47800 VMware Virtual disk 1.0 1300654227 1300654227 ( 0) 210 -- RUNNING 1 3:0:1:0 sdk 0xFFFF880839C43800 VMware Virtual disk 1.0 1297540099 1297540099 ( 0) 226 -- RUNNING 2 3:0:2:0 sdl 0xFFFF880839C40800 VMware Virtual disk 1.0 1375011630 1375011627 ( 3) 200 -- RUNNING ^^^^^^^ crash> p ((struct scsi_device* ) 0xFFFF880839C40800)->ioerr_cnt $39 = { counter = 200 } -
The high error count is not observed only for virtual disk /dev/sdl, but for all the VMware virtual disks we see huge error count:
Ndx Device scsi_device vendor model rev. iorq-cnt done-cnt err-cnt timeout state --- -------------------- ------------------ ------------ ---------------- -------- -------- -------- ------ ------- -------- ------------ 0 3:0:0:0 sdj 0xFFFF880839C47800 VMware Virtual disk 1.0 1300654227 1300654227 ( 0) 210 -- RUNNING 1 3:0:1:0 sdk 0xFFFF880839C43800 VMware Virtual disk 1.0 1297540099 1297540099 ( 0) 226 -- RUNNING 2 3:0:2:0 sdl 0xFFFF880839C40800 VMware Virtual disk 1.0 1375011630 1375011627 ( 3) 200 -- RUNNING 3 3:0:3:0 sdm 0xFFFF880839C3E800 VMware Virtual disk 1.0 1280605918 1280605918 ( 0) 269 -- RUNNING 4 3:0:4:0 sdn 0xFFFF88083A7C5800 VMware Virtual disk 1.0 1287548175 1287548175 ( 0) 279 -- RUNNING 9 3:0:10:0 sds 0xFFFF88083A79F800 VMware Virtual disk 1.0 1413935164 1413935162 ( 2) 270 -- RUNNING ^^^^^^^ -
The logs in vmcore also shows error some of the Vmware virtual disks were removed/disconnected:
vmw_pvscsi: msg type: 0x1 - MSG RING: 1/0 (5) vmw_pvscsi: msg: device removed at scsi0:5:0 vmw_pvscsi: msg type: 0x1 - MSG RING: 2/1 (5) vmw_pvscsi: msg: device removed at scsi0:10:0 -
As per above information it Oracle process (PID: 1230) was stuck in UN state waiting for completion of IO on VMware virtual disk devices, but due to there was some issue in the connecitivty to VMware virtual disks due to which process was not able to complete the IO.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.