Kernel panic due to blocked tasks while ext3 scans group descriptors

Solution Unverified - Updated 5 Aug 2024

Environment

Red Hat Enterprise Linux 6.5
ext3

Issue

System panicked with this stack trace:

INFO: task kjournald:1237 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.5.1.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald     D 0000000000000002     0  1237      2 0x00000080
 ffff880339379d50 0000000000000046 ffff88033b279540 ffffffff8160bf40
 0000000000000001 ffff88033fc23280 ffff880339379da0 ffffffff81060b13
 ffff880335dcb058 ffff880339379fd8 000000000000fbc8 ffff880335dcb058
Call Trace:
 [<ffffffff81060b13>] ? perf_event_task_sched_out+0x33/0x70
 [<ffffffff8109b5be>] ? prepare_to_wait+0x4e/0x80
 [<ffffffffa01916f1>] journal_commit_transaction+0x161/0x1310 [jbd]
 [<ffffffff8109b290>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81084d0b>] ? try_to_del_timer_sync+0x7b/0xe0
 [<ffffffffa0196fe8>] kjournald+0xe8/0x250 [jbd]
 [<ffffffff8109b290>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0196f00>] ? kjournald+0x0/0x250 [jbd]
 [<ffffffff8109aee6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ae50>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

Resolution

Had the hung_task_panic setting not been enabled the scanning would have completed in about 3.5 minutes and the system would have continued. So disabling the hung_task_panic setting will prevent further unnecessary panics.
If the filesystem is full (or near full) then removing some files may help to prevent this problem.
Splitting the ext3 filesystem up into smaller sizes would help to reduce the time it takes to scan all the group descriptors for a single filesystem.
Converting from ext3 to ext4 would help since the layout of the ondisk data structures for ext4 has been optimised to speed up metadata access.
Converting the filesystem to XFS would also reduce the disk access time.

Root Cause

The problem here appears to be a scalability issue where it takes a long time to scan all the group descriptors for a filesystem this large. This blocked another task for more than two minutes and since the hung_task_panic setting was enabled the watchdog triggered a kernel panic.

The filesystem was scanning all of it's group descriptors because it was unable to find disk space. This suggests the filesystem was full.

Diagnostic Steps

Looking at the vmcore:

      KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-431.5.1.el6.x86_64/vmlinux
    DUMPFILE: /cores/retrace/tasks/447028010/crash/vmcore  [PARTIAL DUMP]
        CPUS: 4
        DATE: Wed Dec 17 02:23:21 2014
      UPTIME: 6 days, 22:44:11
LOAD AVERAGE: 7.16, 3.22, 1.77
       TASKS: 299
    NODENAME: <nodename>
     RELEASE: 2.6.32-431.5.1.el6.x86_64
     VERSION: #1 SMP Fri Jan 10 14:46:43 EST 2014
     MACHINE: x86_64  (2400 Mhz)
      MEMORY: 12 GB
       PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"
         PID: 58
     COMMAND: "khungtaskd"
        TASK: ffff880336115500  [THREAD_INFO: ffff880336116000]
         CPU: 1
       STATE: TASK_RUNNING (PANIC)

The system panicked because a task was blocked in uninterruptible state for more than 2 minutes and the hung_task_panic setting was enabled.

INFO: task kjournald:1237 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.5.1.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald     D 0000000000000002     0  1237      2 0x00000080
 ffff880339379d50 0000000000000046 ffff88033b279540 ffffffff8160bf40
 0000000000000001 ffff88033fc23280 ffff880339379da0 ffffffff81060b13
 ffff880335dcb058 ffff880339379fd8 000000000000fbc8 ffff880335dcb058
Call Trace:
 [<ffffffff81060b13>] ? perf_event_task_sched_out+0x33/0x70
 [<ffffffff8109b5be>] ? prepare_to_wait+0x4e/0x80
 [<ffffffffa01916f1>] journal_commit_transaction+0x161/0x1310 [jbd]
 [<ffffffff8109b290>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81084d0b>] ? try_to_del_timer_sync+0x7b/0xe0
 [<ffffffffa0196fe8>] kjournald+0xe8/0x250 [jbd]
 [<ffffffff8109b290>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0196f00>] ? kjournald+0x0/0x250 [jbd]
 [<ffffffff8109aee6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ae50>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

crash> set 1237
    PID: 1237
COMMAND: "kjournald"
   TASK: ffff880335dcaaa0  [THREAD_INFO: ffff880339378000]
    CPU: 2
  STATE: TASK_UNINTERRUPTIBLE 

crash> bt
PID: 1237   TASK: ffff880335dcaaa0  CPU: 2   COMMAND: "kjournald"
 #0 [ffff880339379c90] schedule at ffffffff81527bd2
 #1 [ffff880339379d58] journal_commit_transaction at ffffffffa01916f1 [jbd]
 #2 [ffff880339379e68] kjournald at ffffffffa0196fe8 [jbd]
 #3 [ffff880339379ee8] kthread at ffffffff8109aee6
 #4 [ffff880339379f48] kernel_thread at ffffffff8100c20a

Where is it blocked?

crash> dis -rl ffffffffa01916f1
...
/usr/src/debug/kernel-2.6.32-431.5.1.el6/linux-2.6.32-431.5.1.el6.x86_64/fs/jbd/commit.c: 348
0xffffffffa01916ec <journal_commit_transaction+348>:    callq  0xffffffff81527820 <schedule>

This corresponds to this code:

void journal_commit_transaction(journal_t *journal)
...
	while (commit_transaction->t_updates) {
		DEFINE_WAIT(wait);

		prepare_to_wait(&journal->j_wait_updates, &wait,
					TASK_UNINTERRUPTIBLE);
		if (commit_transaction->t_updates) {
			spin_unlock(&commit_transaction->t_handle_lock);
			spin_unlock(&journal->j_state_lock);
			schedule();                              <----
			spin_lock(&journal->j_state_lock);
			spin_lock(&commit_transaction->t_handle_lock);
		}
		finish_wait(&journal->j_wait_updates, &wait);
	}

So it's waiting for the t_updates counter to drain to zero. This means another process still has the current running transaction open.

Get the transaction:

crash> dis -rl ffffffffa0196fe8
...
/usr/src/debug/kernel-2.6.32-431.5.1.el6/linux-2.6.32-431.5.1.el6.x86_64/fs/jbd/journal.c: 154
0xffffffffa0196fe0 <kjournald+224>:     mov    %rbx,%rdi
0xffffffffa0196fe3 <kjournald+227>:     callq  0xffffffffa0191590 <journal_commit_transaction>

crash> dis journal_commit_transaction
0xffffffffa0191590 <journal_commit_transaction>:        push   %rbp
0xffffffffa0191591 <journal_commit_transaction+1>:      mov    %rsp,%rbp
0xffffffffa0191594 <journal_commit_transaction+4>:      push   %r15
0xffffffffa0191596 <journal_commit_transaction+6>:      push   %r14
0xffffffffa0191598 <journal_commit_transaction+8>:      push   %r13
0xffffffffa019159a <journal_commit_transaction+10>:     push   %r12
0xffffffffa019159c <journal_commit_transaction+12>:     push   %rbx
...

crash> bt -f
PID: 1237   TASK: ffff880335dcaaa0  CPU: 2   COMMAND: "kjournald"
...
 #1 [ffff880339379d58] journal_commit_transaction at ffffffffa01916f1 [jbd]
    ffff880339379d60: 0000000000000001 ffff880028316840 
    ffff880339379d70: 0000000000000000 0000000000000000 
    ffff880339379d80: 0000000000016840 ffff8803378e9000 
    ffff880339379d90: 0000000000000000 ffff880335e22780 
    ffff880339379da0: ffff880339379e60 ffff88033453ac24 
    ffff880339379db0: ffff88033453ac00 ffff880335dcaaa0 
    ffff880339379dc0: 0000000000000000 ffff880335dcaaa0 
    ffff880339379dd0: ffffffff8109b290 ffff88033453ace8 
    ffff880339379de0: ffff88033453ace8 00000000ffffffff 
    ffff880339379df0: ffff88033453ad90 0000000000000286 
    ffff880339379e00: ffff880339379e40 ffffffff81084d0b 
    ffff880339379e10: 0000000000000001 0000000000000000 
    ffff880339379e20: ffff880339379e60 00000000c39320a5 
    ffff880339379e30: ffff88033453ad90 ffff88033453ac00 
    ffff880339379e40: ffff88033453ac24 ffff88033453ad90 
    ffff880339379e50: ffff880335dcaaa0 ffff88033453ac98 
    ffff880339379e60: ffff880339379ee0 ffffffffa0196fe8 
 #2 [ffff880339379e68] kjournald at ffffffffa0196fe8 [jbd]
...

crash> journal_t.j_running_transaction ffff88033453ac00
  j_running_transaction = 0xffff88033688e800

crash> transaction_t.t_updates 0xffff88033688e800
  t_updates = 1

There's one reference still on this transaction. Find out who it is:

crash> foreach task > fe-task
crash> grep "journal_info = 0x[^0]" fe-task
  journal_info = 0xffff880153485a38, 
  journal_info = 0xffff8802e0ebec78, 
  journal_info = 0xffff880153485840, 
  journal_info = 0xffff880153485a50, 
  journal_info = 0xffff880153485a08,

This is the only match:

crash> handle_t 0xffff880153485840
struct handle_t {
  h_transaction = 0xffff88033688e800, 
  h_buffer_credits = 20, 
  h_ref = 1, 
  h_err = 0, 
  h_sync = 0, 
  h_jdata = 0, 
  h_aborted = 0
}

So it's waiting for this process:

crash> set 2092
    PID: 2092
COMMAND: "nfsd"
   TASK: ffff880339bbcae0  [THREAD_INFO: ffff880337384000]
    CPU: 1
  STATE: TASK_UNINTERRUPTIBLE 

crash> bt
PID: 2092   TASK: ffff880339bbcae0  CPU: 1   COMMAND: "nfsd"
 #0 [ffff880337385670] schedule at ffffffff81527bd2
 #1 [ffff880337385738] io_schedule at ffffffff815283b3
 #2 [ffff880337385758] sync_buffer at ffffffff811bf150
 #3 [ffff880337385768] __wait_on_bit at ffffffff81528e7f
 #4 [ffff8803373857b8] out_of_line_wait_on_bit at ffffffff81528f28
 #5 [ffff880337385828] __wait_on_buffer at ffffffff811bf106
 #6 [ffff880337385838] bh_submit_read at ffffffff811bfccd
 #7 [ffff880337385858] read_block_bitmap at ffffffffa01ac2cd [ext3]
 #8 [ffff880337385898] ext3_new_blocks at ffffffffa01ae57f [ext3]
 #9 [ffff880337385978] ext3_new_block at ffffffffa01ae9c1 [ext3]
#10 [ffff880337385998] ext3_xattr_block_set at ffffffffa01ccfab [ext3]
#11 [ffff880337385a18] ext3_xattr_set_handle at ffffffffa01cd748 [ext3]
#12 [ffff880337385b08] ext3_set_acl at ffffffffa01cdf97 [ext3]
#13 [ffff880337385b78] ext3_init_acl at ffffffffa01ce854 [ext3]
#14 [ffff880337385bc8] ext3_new_inode at ffffffffa01afea4 [ext3]
#15 [ffff880337385c78] ext3_mkdir at ffffffffa01b8eb0 [ext3]
#16 [ffff880337385ce8] vfs_mkdir at ffffffff81198129
#17 [ffff880337385d28] nfsd_create at ffffffffa02d5b80 [nfsd]
#18 [ffff880337385d88] nfsd3_proc_mkdir at ffffffffa02dd913 [nfsd]
#19 [ffff880337385dd8] nfsd_dispatch at ffffffffa02ce425 [nfsd]
#20 [ffff880337385e18] svc_process_common at ffffffffa0250844 [sunrpc]
#21 [ffff880337385e98] svc_process at ffffffffa0250e80 [sunrpc]
#22 [ffff880337385eb8] nfsd at ffffffffa02ceb52 [nfsd]
#23 [ffff880337385ee8] kthread at ffffffff8109aee6
#24 [ffff880337385f48] kernel_thread at ffffffff8100c20a

How long has this task been sleeping?

crash> task -R sched_info
PID: 2092   TASK: ffff880339bbcae0  CPU: 1   COMMAND: "nfsd"
  sched_info = {
    pcount = 7873195, 
    run_delay = 34228899704, 
    last_arrival = 600711814084122, 
    last_queued = 0, 
    bkl_count = 0
  }, 

crash> runq -t
 CPU 0: 600711839147485
        600711838570999  PID: 1687   TASK: ffff8803357e5540  COMMAND: "rsyslogd"
 CPU 1: 600711838586859
        600711838200610  PID: 58     TASK: ffff880336115500  COMMAND: "khungtaskd"
 CPU 2: 600711838845660
        000000000000000  PID: 0      TASK: ffff88033b282040  COMMAND: "swapper"
 CPU 3: 600711839162546
        000000000000000  PID: 0      TASK: ffff88033b2c0aa0  COMMAND: "swapper"

crash> pd (600711838586859-600711814084122)
$4 = 24502737

crash> pd (600711838586859-600711814084122)/1000000000
$5 = 0

Less than a second. It's going to take a bit more work to figure this one out.

crash> bt
PID: 2092   TASK: ffff880339bbcae0  CPU: 1   COMMAND: "nfsd"
 #0 [ffff880337385670] schedule at ffffffff81527bd2
 #1 [ffff880337385738] io_schedule at ffffffff815283b3
 #2 [ffff880337385758] sync_buffer at ffffffff811bf150
 #3 [ffff880337385768] __wait_on_bit at ffffffff81528e7f
 #4 [ffff8803373857b8] out_of_line_wait_on_bit at ffffffff81528f28
 #5 [ffff880337385828] __wait_on_buffer at ffffffff811bf106
 #6 [ffff880337385838] bh_submit_read at ffffffff811bfccd
 #7 [ffff880337385858] read_block_bitmap at ffffffffa01ac2cd [ext3]
 #8 [ffff880337385898] ext3_new_blocks at ffffffffa01ae57f [ext3]
 #9 [ffff880337385978] ext3_new_block at ffffffffa01ae9c1 [ext3]
#10 [ffff880337385998] ext3_xattr_block_set at ffffffffa01ccfab [ext3]
#11 [ffff880337385a18] ext3_xattr_set_handle at ffffffffa01cd748 [ext3]
#12 [ffff880337385b08] ext3_set_acl at ffffffffa01cdf97 [ext3]
#13 [ffff880337385b78] ext3_init_acl at ffffffffa01ce854 [ext3]
#14 [ffff880337385bc8] ext3_new_inode at ffffffffa01afea4 [ext3]
#15 [ffff880337385c78] ext3_mkdir at ffffffffa01b8eb0 [ext3]
#16 [ffff880337385ce8] vfs_mkdir at ffffffff81198129
#17 [ffff880337385d28] nfsd_create at ffffffffa02d5b80 [nfsd]
#18 [ffff880337385d88] nfsd3_proc_mkdir at ffffffffa02dd913 [nfsd]
#19 [ffff880337385dd8] nfsd_dispatch at ffffffffa02ce425 [nfsd]
#20 [ffff880337385e18] svc_process_common at ffffffffa0250844 [sunrpc]
#21 [ffff880337385e98] svc_process at ffffffffa0250e80 [sunrpc]
#22 [ffff880337385eb8] nfsd at ffffffffa02ceb52 [nfsd]
#23 [ffff880337385ee8] kthread at ffffffff8109aee6
#24 [ffff880337385f48] kernel_thread at ffffffff8100c20a

In ext3_new_blocks() it potentially needs to synchronously scan all the group descriptors which could take a long time. How many has it scanned so far?

 #7 [ffff880337385858] read_block_bitmap at ffffffffa01ac2cd [ext3]
    ffff880337385860: 0000000000000202 000000000000385f 
    ffff880337385870: 0000000000003860 ffff880326808c00 
    ffff880337385880: 0000000000003860 0000000000002966 
    ffff880337385890: ffff880337385970 ffffffffa01ae57f 
 #8 [ffff880337385898] ext3_new_blocks at ffffffffa01ae57f [ext3]

crash> pd 0x3860
$7 = 14432

How many group descriptors does the filesystem have?

crash> inode.i_sb ffff88013f6e4d98
  i_sb = 0xffff880335e60400

crash> super_block.s_fs_info 0xffff880335e60400
  s_fs_info = 0xffff8803357aaa00

crash> ext3_sb_info.s_groups_count 0xffff8803357aaa00
  s_groups_count = 24429

24429 - that's a lot. This filesystem must be huge.

crash> ext3_sb_info.s_blocks_per_group 0xffff8803357aaa00
  s_blocks_per_group = 32768

crash> pd 32768*24429
$8 = 800489472

So that's about 3.2TB.

Okay so the problem here appears to be a scalability issue where it takes a long time to scan all the group descriptors for filesystem this large. Had the hung_task_panic setting not been enabled the scanning would have completed in about 3.5 minutes and the system would have continued. So disabling the hung_task_panic setting will prevent unnecessary panics.

There's still the question of why the filesystem was trying to scan so many group descriptors while allocating space for an extended attribute. Is the filesystem full or near full? If it is full then you could avoid this problem by freeing up some disk space. If it's not near full then could you provide the output of the dumpe2fs command? You will need to unmount the filesystem first and then run the command with the device as the argument.

SBR

Filesystem

Product(s)

Red Hat Enterprise Linux

Components

kernel

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.