Server crashed with hung task panic trigger after "jbd2/dm-9-8:762" was blocked for more than 120 seconds.

Solution Unverified - Updated

Environment

  • Red Hat Enterprise Linux 5.6

Issue

  • Server crashed with hung task panic trigger after "jbd2/dm-9-8:762" was blocked for more than 120 seconds.
crash> log
INFO: task jbd2/dm-9-8:762 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/dm-9-8   D ffff81043e0a1420     0   762    585           763   748 (L-TLB)
 ffff81041daa7d60 0000000000000046 ffff81041daa7d00 ffffffff8008cf5f
 0000000000000001 000000000000000a ffff81102aa04080 ffff810c3e1197e0
 000123b39f5946b9 0000000000000bab ffff81102aa04268 0000000300000000
Call Trace:
 [<ffffffff8008cf5f>] find_busiest_group+0x20d/0x621
 [<ffffffff88582f6d>] :jbd2:jbd2_journal_commit_transaction+0x191/0x1068
 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8003dbe6>] lock_timer_base+0x1b/0x3c
 [<ffffffff8004b23a>] try_to_del_timer_sync+0x7f/0x88
 [<ffffffff88586f7c>] :jbd2:kjournald2+0x9a/0x1ec
 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88586ee2>] :jbd2:kjournald2+0x0/0x1ec
 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032974>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032876>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

Kernel panic - not syncing: hung_task: blocked tasks

Resolution

Don't enable the hung_task_panic setting. Given a chance the fsync() operation and journal commit operations will eventually complete and any blocked tasks will be allowed to run again.

Optionally consider calling fsync() more frequently to reduce the number of outstanding dirty pages and therefore reduce the time taken by any single fsync() operation. Also consider analysing the I/O access pattern of large files to ensure they are written serially or are allocated physically contiguous on disk.

Root Cause

A process doing an fsync() operation on a large sparsely populated file is taking a long time to flush all the dirty pages to disk. During this time the fsync() operation holds a reference to the currently running transaction which holds up the journal task from committing the current transaction. Any other tasks needing to start a new transaction will need to wait until the journal task commits the current transaction and therefore need to also wait for the fsync() operation to complete.

Diagnostic Steps

  • Server rebooted because of hung task panic trigger.

    kernel.hung_task_panic = 1

      KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.18-238.el5/vmlinux
    DUMPFILE: vmcore
        CPUS: 16
        DATE: Mon Nov 11 23:51:14 2013
      UPTIME: 3 days, 17:12:04
LOAD AVERAGE: 4.83, 2.75, 1.82 <------ There was not much load on server
       TASKS: 587
     RELEASE: 2.6.18-238.el5
     VERSION: #1 SMP Sun Dec 19 14:22:44 EST 2010
     MACHINE: x86_64  (2400 Mhz)
      MEMORY: 63.1 GB
       PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"
         PID: 996
     COMMAND: "khungtaskd"
        TASK: ffff81042f9a9860  [THREAD_INFO: ffff81082f510000]
         CPU: 0
       STATE: TASK_RUNNING (PANIC)
  • Ring Buffer
crash> log
INFO: task jbd2/dm-9-8:762 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/dm-9-8   D ffff81043e0a1420     0   762    585           763   748 (L-TLB)
 ffff81041daa7d60 0000000000000046 ffff81041daa7d00 ffffffff8008cf5f
 0000000000000001 000000000000000a ffff81102aa04080 ffff810c3e1197e0
 000123b39f5946b9 0000000000000bab ffff81102aa04268 0000000300000000
Call Trace:
 [<ffffffff8008cf5f>] find_busiest_group+0x20d/0x621
 [<ffffffff88582f6d>] :jbd2:jbd2_journal_commit_transaction+0x191/0x1068
 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8003dbe6>] lock_timer_base+0x1b/0x3c
 [<ffffffff8004b23a>] try_to_del_timer_sync+0x7f/0x88
 [<ffffffff88586f7c>] :jbd2:kjournald2+0x9a/0x1ec
 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88586ee2>] :jbd2:kjournald2+0x0/0x1ec
 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032974>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032876>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

Kernel panic - not syncing: hung_task: blocked tasks
  • Backtraces
crash>  !grep UN ps_udt.txt
[  0 00:00:00.002] UN+[UN]  PID: 2495   TASK: ffff81042fa117a0  CPU: 10  COMMAND: "klogd"
[  0 00:00:00.002] UN+[UN]  PID: 647    TASK: ffff81102d154040  CPU: 8   COMMAND: "kjournald"
[  0 00:00:00.002] UN+[UN]  PID: 2489   TASK: ffff81042f959080  CPU: 9   COMMAND: "syslogd"
[  0 00:00:00.004] UN+[UN]  PID: 23364  TASK: ffff8105c6a75040  CPU: 8   COMMAND: "ibbackup"
[  0 00:02:03.805] UN+[UN]  PID: 14715  TASK: ffff8110139f2040  CPU: 2   COMMAND: "mysqld"
[  0 00:02:04.189] UN+[UN]  PID: 28414  TASK: ffff810c7ed88080  CPU: 5   COMMAND: "pdflush"
[  0 00:02:04.435] UN+[UN]  PID: 762    TASK: ffff81102aa04080  CPU: 3   COMMAND: "jbd2/dm-9-8"
[  0 00:02:04.435] UN+[UN]  PID: 14705  TASK: ffff8106b0c370c0  CPU: 3   COMMAND: "mysqld"
Task states: RU (  17), IN( 562), UN(   8), ZO( 0), ST( 0), DE( 0), SW( 0)
[RU] = Running   [IN] = Interruptible   [UN] = *Uninterruptible*

There are 4 processes which are pending since more than 120 seconds. All four have ext4/jbd2 traces:

crash> bt 14705
PID: 14705  TASK: ffff8106b0c370c0  CPU: 3   COMMAND: "mysqld"
 #0 [ffff8106bfac7db8] schedule at ffffffff80062f8e
 #1 [ffff8106bfac7e90] jbd2_log_wait_commit at ffffffff88586568 [jbd2]
 #2 [ffff8106bfac7ef0] ext4_sync_file at ffffffff8859bc6c [ext4]
 #3 [ffff8106bfac7f30] do_fsync at ffffffff80050310
 #4 [ffff8106bfac7f60] __do_fsync at ffffffff800e33fe
 #5 [ffff8106bfac7f80] system_call at ffffffff8005d116
    RIP: 000000307240e047  RSP: 0000000048a2d088  RFLAGS: 00010206
    RAX: 000000000000004a  RBX: ffffffff8005d116  RCX: 0000000000000000
    RDX: 0000000000000002  RSI: 0000000000000000  RDI: 0000000000000009
    RBP: 0000000000000009   R8: 000000007ff00000   R9: 000000007feff800
    R10: 000000006e818200  R11: 0000000000000206  R12: 0000000048a2cec0
    R13: 0000000000000000  R14: ffffffff800e33fe  R15: 0000000051eb851f
    ORIG_RAX: 000000000000004a  CS: 0033  SS: 002b
crash> bt 762
PID: 762    TASK: ffff81102aa04080  CPU: 3   COMMAND: "jbd2/dm-9-8"
 #0 [ffff81041daa7c90] schedule at ffffffff80062f8e
 #1 [ffff81041daa7d68] jbd2_journal_commit_transaction at ffffffff88582f6d [jbd2]
 #2 [ffff81041daa7e78] kjournald2 at ffffffff88586f7c [jbd2]
 #3 [ffff81041daa7ee8] kthread at ffffffff80032974
 #4 [ffff81041daa7f48] kernel_thread at ffffffff8005dfb1
crash> bt 28414
PID: 28414  TASK: ffff810c7ed88080  CPU: 5   COMMAND: "pdflush"
 #0 [ffff810b72183aa0] schedule at ffffffff80062f8e
 #1 [ffff810b72183b78] start_this_handle at ffffffff88581c9e [jbd2]
 #2 [ffff810b72183bf8] jbd2_journal_start at ffffffff88581e09 [jbd2]
 #3 [ffff810b72183c28] ext4_da_writepages at ffffffff885a381c [ext4]
 #4 [ffff810b72183d08] do_writepages at ffffffff8005ac1d
 #5 [ffff810b72183d18] __writeback_single_inode at ffffffff8002fc9d
 #6 [ffff810b72183da8] sync_sb_inodes at ffffffff80020fd6
 #7 [ffff810b72183de8] writeback_inodes at ffffffff80051147
 #8 [ffff810b72183e08] wb_kupdate at ffffffff800cbb98
 #9 [ffff810b72183e78] pdflush at ffffffff80056783
#10 [ffff810b72183ee8] kthread at ffffffff80032974
#11 [ffff810b72183f48] kernel_thread at ffffffff8005dfb1
crash> bt 14715
PID: 14715  TASK: ffff8110139f2040  CPU: 2   COMMAND: "mysqld"
 #0 [ffff810e0e715b08] schedule at ffffffff80062f8e
 #1 [ffff810e0e715be0] start_this_handle at ffffffff88581c9e [jbd2]
 #2 [ffff810e0e715c60] jbd2_journal_start at ffffffff88581e09 [jbd2]
 #3 [ffff810e0e715c90] ext4_dirty_inode at ffffffff885a40d8 [ext4]
 #4 [ffff810e0e715cb0] __mark_inode_dirty at ffffffff80013c94
 #5 [ffff810e0e715ce0] __generic_file_aio_write_nolock at ffffffff8001659a
 #6 [ffff810e0e715d80] generic_file_aio_write at ffffffff80021850
 #7 [ffff810e0e715dd0] ext4_file_write at ffffffff8859ba1c [ext4]
 #8 [ffff810e0e715e00] do_sync_write at ffffffff800182df
 #9 [ffff810e0e715f10] vfs_write at ffffffff80016a81
#10 [ffff810e0e715f40] sys_write at ffffffff80017339
#11 [ffff810e0e715f80] system_call at ffffffff8005d116

The task reported as being blocked is:

crash> set 762
    PID: 762
COMMAND: "jbd2/dm-9-8"
   TASK: ffff81102aa04080  [THREAD_INFO: ffff81041daa6000]
    CPU: 3
  STATE: TASK_UNINTERRUPTIBLE 

crash> bt
PID: 762    TASK: ffff81102aa04080  CPU: 3   COMMAND: "jbd2/dm-9-8"
 #0 [ffff81041daa7c90] schedule at ffffffff80062f8e
 #1 [ffff81041daa7d68] jbd2_journal_commit_transaction at ffffffff88582f6d [jbd2]
 #2 [ffff81041daa7e78] kjournald2 at ffffffff88586f7c [jbd2]
 #3 [ffff81041daa7ee8] kthread at ffffffff80032974
 #4 [ffff81041daa7f48] kernel_thread at ffffffff8005dfb1

It's in this code in jbd2_journal_commit_transaction():

	while (commit_transaction->t_updates) {
		DEFINE_WAIT(wait);

		prepare_to_wait(&journal->j_wait_updates, &wait,
					TASK_UNINTERRUPTIBLE);
		if (commit_transaction->t_updates) {
			spin_unlock(&commit_transaction->t_handle_lock);
			spin_unlock(&journal->j_state_lock);
			schedule();
			spin_lock(&journal->j_state_lock);
			spin_lock(&commit_transaction->t_handle_lock);
		}
		finish_wait(&journal->j_wait_updates, &wait);
	}

This is a journal task committing the current running transaction but is held up because the t_updates reference counter on the transaction is non-zero indicating other tasks are still using the transaction.

So we need to find any tasks that could be using that transaction and see what's holding them up.

Looking at the current blocked tasks:

crash> ps | grep UN
    647    585   8  ffff81102d154040  UN   0.0       0      0  [kjournald]
    762    585   3  ffff81102aa04080  UN   0.0       0      0  [jbd2/dm-9-8]
   2489      1   9  ffff81042f959080  UN   0.0    5932    708  syslogd
   2495      1  10  ffff81042fa117a0  UN   0.0    3828    416  klogd
  14705   8348   3  ffff8106b0c370c0  UN  75.8 51852436 51453640  mysqld
  14715   8348   2  ffff8110139f2040  UN  75.8 51852436 51453640  mysqld
  23364  27233   8  ffff8105c6a75040  UN   0.0   12104   5356  ibbackup
  28414    585   5  ffff810c7ed88080  UN   0.0       0      0  [pdflush]

This one looks likely:

crash> set 23364
    PID: 23364
COMMAND: "ibbackup"
   TASK: ffff8105c6a75040  [THREAD_INFO: ffff810750ed0000]
    CPU: 8
  STATE: TASK_UNINTERRUPTIBLE 

crash> bt
PID: 23364  TASK: ffff8105c6a75040  CPU: 8   COMMAND: "ibbackup"
 #0 [ffff810750ed1548] schedule at ffffffff80062f8e
 #1 [ffff810750ed1620] io_schedule at ffffffff800637ca
 #2 [ffff810750ed1640] sync_buffer at ffffffff800154ed
 #3 [ffff810750ed1650] __wait_on_bit at ffffffff800639f6
 #4 [ffff810750ed1690] out_of_line_wait_on_bit at ffffffff80063a90
 #5 [ffff810750ed1700] ext4_mb_init_cache at ffffffff885b89cb [ext4]
 #6 [ffff810750ed17c0] ext4_mb_load_buddy at ffffffff885b9399 [ext4]
 #7 [ffff810750ed1810] ext4_mb_regular_allocator at ffffffff885b9df2 [ext4]
 #8 [ffff810750ed18c0] ext4_mb_new_blocks at ffffffff885bc964 [ext4]
 #9 [ffff810750ed1930] ext4_ext_get_blocks at ffffffff885b4c19 [ext4]
#10 [ffff810750ed1ac0] ext4_get_blocks at ffffffff8859fe97 [ext4]
#11 [ffff810750ed1b10] mpage_da_map_blocks at ffffffff885a0057 [ext4]
#12 [ffff810750ed1c60] mpage_add_bh_to_extent at ffffffff885a06c5 [ext4]
#13 [ffff810750ed1c90] __mpage_da_writepage at ffffffff885a217e [ext4]
#14 [ffff810750ed1cc0] write_cache_pages at ffffffff800f61e7
#15 [ffff810750ed1de0] ext4_da_writepages at ffffffff885a389a [ext4]
#16 [ffff810750ed1ec0] do_writepages at ffffffff8005ac1d
#17 [ffff810750ed1ed0] __filemap_fdatawrite_range at ffffffff8004fbe4
#18 [ffff810750ed1f30] do_fsync at ffffffff800502ed
#19 [ffff810750ed1f60] __do_fsync at ffffffff800e33fe
#20 [ffff810750ed1f80] system_call at ffffffff8005d116
    RIP: 00000030718cd620  RSP: 00007fff39d218e0  RFLAGS: 00000246
    RAX: 000000000000004a  RBX: ffffffff8005d116  RCX: 00000030718c46b3
    RDX: 0000000000001900  RSI: 0000000007ce7d28  RDI: 0000000000000006
    RBP: 0000000051eb851f   R8: 0000000000000028   R9: 0000000000100000
    R10: 000000288b700000  R11: 0000000000000246  R12: 0000000000000006
    R13: 0000000000000000  R14: ffffffff800e33fe  R15: 0000000000040000
    ORIG_RAX: 000000000000004a  CS: 0033  SS: 002b

Pulling the file pointer off the stack and comparing the block device with the journal's filesystem block device:

crash> struct file.f_vfsmnt ffff81082d6472c0
  f_vfsmnt = 0xffff81010eb0b380

crash> vfsmount.mnt_sb 0xffff81010eb0b380
  mnt_sb = 0xffff81042e510800

crash> super_block.s_bdev 0xffff81042e510800
  s_bdev = 0xffff81042e3710c0

crash> journal_t.j_dev,j_fs_dev ffff81042d1cb000
  j_dev = 0xffff81042e3710c0
  j_fs_dev = 0xffff81042e3710c0

That's a match so this task is operating on the same filesystem.

From the stack above we can see that ext4_da_writepages() has called write_cache_pages() and this is after starting a transaction.

Pulling the transaction handle off the stack:

crash> handle_t.h_transaction ffff81053ff0efd8
  h_transaction = 0xffff81082fbfcc80

crash> journal_t.j_running_transaction ffff81042d1cb000
  j_running_transaction = 0xffff81082fbfcc80

That's a match with the transaction the journal task is trying to commit. So what's holding up the ibbackup task?

crash> task -R last_ran
PID: 23364  TASK: ffff8105c6a75040  CPU: 8   COMMAND: "ibbackup"
  last_ran = 320853786401127, 

crash> runq | grep "CPU 8"
CPU 8 RUNQUEUE: ffff810001015120

crash> rq.timestamp_last_tick ffff810001015120
  timestamp_last_tick = 320853791294947

crash> pd (320853791294947-320853786401127)
$1 = 4893820

crash> pd (320853791294947-320853786401127)/1000000000
$2 = 0

Blocked for just 4 milliseconds so it's running. Just must have a lot of work to do to fsync the file and this has caused a delay.

crash> struct file.f_mapping ffff81082d6472c0
  f_mapping = 0xffff8108800f99a8

crash> address_space.host,nrpages ffff8108800f99a8
  host = 0xffff8108800f9898
  nrpages = 1072292

Potentially 1 million pages to write out.

crash> pd 1072292*4096
$4 = 97140736

92MB worth of pages.

crash> inode.i_size 0xffff8108800f9898
  i_size = 174139113472

File size is 162GB so there's a good chance it's sparsely allocated and would require many single page I/Os to flush the file to disk.

The task is current blocked on the lock for a buffer:

PID: 23364  TASK: ffff8105c6a75040  CPU: 8   COMMAND: "ibbackup"
 #0 [ffff810750ed1548] schedule at ffffffff80062f8e
 #1 [ffff810750ed1620] io_schedule at ffffffff800637ca
 #2 [ffff810750ed1640] sync_buffer at ffffffff800154ed
 #3 [ffff810750ed1650] __wait_on_bit at ffffffff800639f6
 #4 [ffff810750ed1690] out_of_line_wait_on_bit at ffffffff80063a90
 #5 [ffff810750ed1700] ext4_mb_init_cache at ffffffff885b89cb [ext4]
...

The buffer is:

crash> buffer_head ffff81014ad4ea90
struct buffer_head {
  b_state = 8388652, 
  b_this_page = 0xffff81014ad4ea90, 
  b_page = 0xffff810c311c1038, 
  b_blocknr = 3800452696, 
  b_size = 4096, 
  b_data = 0xffff810c7ec01000 "\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377", 
  b_bdev = 0xffff81042e3710c0, 
  b_end_io = 0xffffffff800321e6 <end_buffer_read_sync>, 
  b_private = 0x0 <__mod_license2529>, 
  b_assoc_buffers = {
    next = 0xffff81014ad4ead8, 
    prev = 0xffff81014ad4ead8
  }, 
  b_count = {
    counter = 3
  }
}

Check the state flags:

crash> eval -b 8388652
hexadecimal: 80002c  
    decimal: 8388652  
      octal: 40000054
     binary: 0000000000000000000000000000000000000000100000000000000000101100
   bits set: 23 5 3 2 

flags include BH_Lock and BH_Req so the buffer is currently under I/O and this task is waiting for the I/O to complete. Since it's only been waiting 4 milliseconds it doesn't indicate a problem or a deadlock of any kind but rather just congestion on the storage device. I'm sure this problem would have cleared had hung_task_panic not been set.

The large file that was being fsync()'d in this case was:

crash> files -d 0xffff810997319588
     DENTRY           INODE           SUPERBLK     TYPE PATH
ffff810997319588 ffff8108800f9898 ffff81042e510800 REG  /var/lib/mysql/online_Backup/FullBkp/2013-11-11__1-09-24/uidv1_1/uid_master.ibd
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.