Kernel panic due to blocked tasks while ext3 scans group descriptors
Environment
- Red Hat Enterprise Linux 6.5
- ext3
Issue
- System panicked with this stack trace:
INFO: task kjournald:1237 blocked for more than 120 seconds.
Not tainted 2.6.32-431.5.1.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald D 0000000000000002 0 1237 2 0x00000080
ffff880339379d50 0000000000000046 ffff88033b279540 ffffffff8160bf40
0000000000000001 ffff88033fc23280 ffff880339379da0 ffffffff81060b13
ffff880335dcb058 ffff880339379fd8 000000000000fbc8 ffff880335dcb058
Call Trace:
[<ffffffff81060b13>] ? perf_event_task_sched_out+0x33/0x70
[<ffffffff8109b5be>] ? prepare_to_wait+0x4e/0x80
[<ffffffffa01916f1>] journal_commit_transaction+0x161/0x1310 [jbd]
[<ffffffff8109b290>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81084d0b>] ? try_to_del_timer_sync+0x7b/0xe0
[<ffffffffa0196fe8>] kjournald+0xe8/0x250 [jbd]
[<ffffffff8109b290>] ? autoremove_wake_function+0x0/0x40
[<ffffffffa0196f00>] ? kjournald+0x0/0x250 [jbd]
[<ffffffff8109aee6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffff8109ae50>] ? kthread+0x0/0xa0
[<ffffffff8100c200>] ? child_rip+0x0/0x20
Resolution
- Had the hung_task_panic setting not been enabled the scanning would have completed in about 3.5 minutes and the system would have continued. So disabling the hung_task_panic setting will prevent further unnecessary panics.
- If the filesystem is full (or near full) then removing some files may help to prevent this problem.
- Splitting the ext3 filesystem up into smaller sizes would help to reduce the time it takes to scan all the group descriptors for a single filesystem.
- Converting from ext3 to ext4 would help since the layout of the ondisk data structures for ext4 has been optimised to speed up metadata access.
- Converting the filesystem to XFS would also reduce the disk access time.
Root Cause
The problem here appears to be a scalability issue where it takes a long time to scan all the group descriptors for a filesystem this large. This blocked another task for more than two minutes and since the hung_task_panic setting was enabled the watchdog triggered a kernel panic.
The filesystem was scanning all of it's group descriptors because it was unable to find disk space. This suggests the filesystem was full.
Diagnostic Steps
Looking at the vmcore:
KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-431.5.1.el6.x86_64/vmlinux
DUMPFILE: /cores/retrace/tasks/447028010/crash/vmcore [PARTIAL DUMP]
CPUS: 4
DATE: Wed Dec 17 02:23:21 2014
UPTIME: 6 days, 22:44:11
LOAD AVERAGE: 7.16, 3.22, 1.77
TASKS: 299
NODENAME: <nodename>
RELEASE: 2.6.32-431.5.1.el6.x86_64
VERSION: #1 SMP Fri Jan 10 14:46:43 EST 2014
MACHINE: x86_64 (2400 Mhz)
MEMORY: 12 GB
PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"
PID: 58
COMMAND: "khungtaskd"
TASK: ffff880336115500 [THREAD_INFO: ffff880336116000]
CPU: 1
STATE: TASK_RUNNING (PANIC)
The system panicked because a task was blocked in uninterruptible state for more than 2 minutes and the hung_task_panic setting was enabled.
INFO: task kjournald:1237 blocked for more than 120 seconds.
Not tainted 2.6.32-431.5.1.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald D 0000000000000002 0 1237 2 0x00000080
ffff880339379d50 0000000000000046 ffff88033b279540 ffffffff8160bf40
0000000000000001 ffff88033fc23280 ffff880339379da0 ffffffff81060b13
ffff880335dcb058 ffff880339379fd8 000000000000fbc8 ffff880335dcb058
Call Trace:
[<ffffffff81060b13>] ? perf_event_task_sched_out+0x33/0x70
[<ffffffff8109b5be>] ? prepare_to_wait+0x4e/0x80
[<ffffffffa01916f1>] journal_commit_transaction+0x161/0x1310 [jbd]
[<ffffffff8109b290>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81084d0b>] ? try_to_del_timer_sync+0x7b/0xe0
[<ffffffffa0196fe8>] kjournald+0xe8/0x250 [jbd]
[<ffffffff8109b290>] ? autoremove_wake_function+0x0/0x40
[<ffffffffa0196f00>] ? kjournald+0x0/0x250 [jbd]
[<ffffffff8109aee6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffff8109ae50>] ? kthread+0x0/0xa0
[<ffffffff8100c200>] ? child_rip+0x0/0x20
crash> set 1237
PID: 1237
COMMAND: "kjournald"
TASK: ffff880335dcaaa0 [THREAD_INFO: ffff880339378000]
CPU: 2
STATE: TASK_UNINTERRUPTIBLE
crash> bt
PID: 1237 TASK: ffff880335dcaaa0 CPU: 2 COMMAND: "kjournald"
#0 [ffff880339379c90] schedule at ffffffff81527bd2
#1 [ffff880339379d58] journal_commit_transaction at ffffffffa01916f1 [jbd]
#2 [ffff880339379e68] kjournald at ffffffffa0196fe8 [jbd]
#3 [ffff880339379ee8] kthread at ffffffff8109aee6
#4 [ffff880339379f48] kernel_thread at ffffffff8100c20a
Where is it blocked?
crash> dis -rl ffffffffa01916f1
...
/usr/src/debug/kernel-2.6.32-431.5.1.el6/linux-2.6.32-431.5.1.el6.x86_64/fs/jbd/commit.c: 348
0xffffffffa01916ec <journal_commit_transaction+348>: callq 0xffffffff81527820 <schedule>
This corresponds to this code:
void journal_commit_transaction(journal_t *journal)
...
while (commit_transaction->t_updates) {
DEFINE_WAIT(wait);
prepare_to_wait(&journal->j_wait_updates, &wait,
TASK_UNINTERRUPTIBLE);
if (commit_transaction->t_updates) {
spin_unlock(&commit_transaction->t_handle_lock);
spin_unlock(&journal->j_state_lock);
schedule(); <----
spin_lock(&journal->j_state_lock);
spin_lock(&commit_transaction->t_handle_lock);
}
finish_wait(&journal->j_wait_updates, &wait);
}
So it's waiting for the t_updates counter to drain to zero. This means another process still has the current running transaction open.
Get the transaction:
crash> dis -rl ffffffffa0196fe8
...
/usr/src/debug/kernel-2.6.32-431.5.1.el6/linux-2.6.32-431.5.1.el6.x86_64/fs/jbd/journal.c: 154
0xffffffffa0196fe0 <kjournald+224>: mov %rbx,%rdi
0xffffffffa0196fe3 <kjournald+227>: callq 0xffffffffa0191590 <journal_commit_transaction>
crash> dis journal_commit_transaction
0xffffffffa0191590 <journal_commit_transaction>: push %rbp
0xffffffffa0191591 <journal_commit_transaction+1>: mov %rsp,%rbp
0xffffffffa0191594 <journal_commit_transaction+4>: push %r15
0xffffffffa0191596 <journal_commit_transaction+6>: push %r14
0xffffffffa0191598 <journal_commit_transaction+8>: push %r13
0xffffffffa019159a <journal_commit_transaction+10>: push %r12
0xffffffffa019159c <journal_commit_transaction+12>: push %rbx
...
crash> bt -f
PID: 1237 TASK: ffff880335dcaaa0 CPU: 2 COMMAND: "kjournald"
...
#1 [ffff880339379d58] journal_commit_transaction at ffffffffa01916f1 [jbd]
ffff880339379d60: 0000000000000001 ffff880028316840
ffff880339379d70: 0000000000000000 0000000000000000
ffff880339379d80: 0000000000016840 ffff8803378e9000
ffff880339379d90: 0000000000000000 ffff880335e22780
ffff880339379da0: ffff880339379e60 ffff88033453ac24
ffff880339379db0: ffff88033453ac00 ffff880335dcaaa0
ffff880339379dc0: 0000000000000000 ffff880335dcaaa0
ffff880339379dd0: ffffffff8109b290 ffff88033453ace8
ffff880339379de0: ffff88033453ace8 00000000ffffffff
ffff880339379df0: ffff88033453ad90 0000000000000286
ffff880339379e00: ffff880339379e40 ffffffff81084d0b
ffff880339379e10: 0000000000000001 0000000000000000
ffff880339379e20: ffff880339379e60 00000000c39320a5
ffff880339379e30: ffff88033453ad90 ffff88033453ac00
ffff880339379e40: ffff88033453ac24 ffff88033453ad90
ffff880339379e50: ffff880335dcaaa0 ffff88033453ac98
ffff880339379e60: ffff880339379ee0 ffffffffa0196fe8
#2 [ffff880339379e68] kjournald at ffffffffa0196fe8 [jbd]
...
crash> journal_t.j_running_transaction ffff88033453ac00
j_running_transaction = 0xffff88033688e800
crash> transaction_t.t_updates 0xffff88033688e800
t_updates = 1
There's one reference still on this transaction. Find out who it is:
crash> foreach task > fe-task
crash> grep "journal_info = 0x[^0]" fe-task
journal_info = 0xffff880153485a38,
journal_info = 0xffff8802e0ebec78,
journal_info = 0xffff880153485840,
journal_info = 0xffff880153485a50,
journal_info = 0xffff880153485a08,
This is the only match:
crash> handle_t 0xffff880153485840
struct handle_t {
h_transaction = 0xffff88033688e800,
h_buffer_credits = 20,
h_ref = 1,
h_err = 0,
h_sync = 0,
h_jdata = 0,
h_aborted = 0
}
So it's waiting for this process:
crash> set 2092
PID: 2092
COMMAND: "nfsd"
TASK: ffff880339bbcae0 [THREAD_INFO: ffff880337384000]
CPU: 1
STATE: TASK_UNINTERRUPTIBLE
crash> bt
PID: 2092 TASK: ffff880339bbcae0 CPU: 1 COMMAND: "nfsd"
#0 [ffff880337385670] schedule at ffffffff81527bd2
#1 [ffff880337385738] io_schedule at ffffffff815283b3
#2 [ffff880337385758] sync_buffer at ffffffff811bf150
#3 [ffff880337385768] __wait_on_bit at ffffffff81528e7f
#4 [ffff8803373857b8] out_of_line_wait_on_bit at ffffffff81528f28
#5 [ffff880337385828] __wait_on_buffer at ffffffff811bf106
#6 [ffff880337385838] bh_submit_read at ffffffff811bfccd
#7 [ffff880337385858] read_block_bitmap at ffffffffa01ac2cd [ext3]
#8 [ffff880337385898] ext3_new_blocks at ffffffffa01ae57f [ext3]
#9 [ffff880337385978] ext3_new_block at ffffffffa01ae9c1 [ext3]
#10 [ffff880337385998] ext3_xattr_block_set at ffffffffa01ccfab [ext3]
#11 [ffff880337385a18] ext3_xattr_set_handle at ffffffffa01cd748 [ext3]
#12 [ffff880337385b08] ext3_set_acl at ffffffffa01cdf97 [ext3]
#13 [ffff880337385b78] ext3_init_acl at ffffffffa01ce854 [ext3]
#14 [ffff880337385bc8] ext3_new_inode at ffffffffa01afea4 [ext3]
#15 [ffff880337385c78] ext3_mkdir at ffffffffa01b8eb0 [ext3]
#16 [ffff880337385ce8] vfs_mkdir at ffffffff81198129
#17 [ffff880337385d28] nfsd_create at ffffffffa02d5b80 [nfsd]
#18 [ffff880337385d88] nfsd3_proc_mkdir at ffffffffa02dd913 [nfsd]
#19 [ffff880337385dd8] nfsd_dispatch at ffffffffa02ce425 [nfsd]
#20 [ffff880337385e18] svc_process_common at ffffffffa0250844 [sunrpc]
#21 [ffff880337385e98] svc_process at ffffffffa0250e80 [sunrpc]
#22 [ffff880337385eb8] nfsd at ffffffffa02ceb52 [nfsd]
#23 [ffff880337385ee8] kthread at ffffffff8109aee6
#24 [ffff880337385f48] kernel_thread at ffffffff8100c20a
How long has this task been sleeping?
crash> task -R sched_info
PID: 2092 TASK: ffff880339bbcae0 CPU: 1 COMMAND: "nfsd"
sched_info = {
pcount = 7873195,
run_delay = 34228899704,
last_arrival = 600711814084122,
last_queued = 0,
bkl_count = 0
},
crash> runq -t
CPU 0: 600711839147485
600711838570999 PID: 1687 TASK: ffff8803357e5540 COMMAND: "rsyslogd"
CPU 1: 600711838586859
600711838200610 PID: 58 TASK: ffff880336115500 COMMAND: "khungtaskd"
CPU 2: 600711838845660
000000000000000 PID: 0 TASK: ffff88033b282040 COMMAND: "swapper"
CPU 3: 600711839162546
000000000000000 PID: 0 TASK: ffff88033b2c0aa0 COMMAND: "swapper"
crash> pd (600711838586859-600711814084122)
$4 = 24502737
crash> pd (600711838586859-600711814084122)/1000000000
$5 = 0
Less than a second. It's going to take a bit more work to figure this one out.
crash> bt
PID: 2092 TASK: ffff880339bbcae0 CPU: 1 COMMAND: "nfsd"
#0 [ffff880337385670] schedule at ffffffff81527bd2
#1 [ffff880337385738] io_schedule at ffffffff815283b3
#2 [ffff880337385758] sync_buffer at ffffffff811bf150
#3 [ffff880337385768] __wait_on_bit at ffffffff81528e7f
#4 [ffff8803373857b8] out_of_line_wait_on_bit at ffffffff81528f28
#5 [ffff880337385828] __wait_on_buffer at ffffffff811bf106
#6 [ffff880337385838] bh_submit_read at ffffffff811bfccd
#7 [ffff880337385858] read_block_bitmap at ffffffffa01ac2cd [ext3]
#8 [ffff880337385898] ext3_new_blocks at ffffffffa01ae57f [ext3]
#9 [ffff880337385978] ext3_new_block at ffffffffa01ae9c1 [ext3]
#10 [ffff880337385998] ext3_xattr_block_set at ffffffffa01ccfab [ext3]
#11 [ffff880337385a18] ext3_xattr_set_handle at ffffffffa01cd748 [ext3]
#12 [ffff880337385b08] ext3_set_acl at ffffffffa01cdf97 [ext3]
#13 [ffff880337385b78] ext3_init_acl at ffffffffa01ce854 [ext3]
#14 [ffff880337385bc8] ext3_new_inode at ffffffffa01afea4 [ext3]
#15 [ffff880337385c78] ext3_mkdir at ffffffffa01b8eb0 [ext3]
#16 [ffff880337385ce8] vfs_mkdir at ffffffff81198129
#17 [ffff880337385d28] nfsd_create at ffffffffa02d5b80 [nfsd]
#18 [ffff880337385d88] nfsd3_proc_mkdir at ffffffffa02dd913 [nfsd]
#19 [ffff880337385dd8] nfsd_dispatch at ffffffffa02ce425 [nfsd]
#20 [ffff880337385e18] svc_process_common at ffffffffa0250844 [sunrpc]
#21 [ffff880337385e98] svc_process at ffffffffa0250e80 [sunrpc]
#22 [ffff880337385eb8] nfsd at ffffffffa02ceb52 [nfsd]
#23 [ffff880337385ee8] kthread at ffffffff8109aee6
#24 [ffff880337385f48] kernel_thread at ffffffff8100c20a
In ext3_new_blocks() it potentially needs to synchronously scan all the group descriptors which could take a long time. How many has it scanned so far?
#7 [ffff880337385858] read_block_bitmap at ffffffffa01ac2cd [ext3]
ffff880337385860: 0000000000000202 000000000000385f
ffff880337385870: 0000000000003860 ffff880326808c00
ffff880337385880: 0000000000003860 0000000000002966
ffff880337385890: ffff880337385970 ffffffffa01ae57f
#8 [ffff880337385898] ext3_new_blocks at ffffffffa01ae57f [ext3]
crash> pd 0x3860
$7 = 14432
How many group descriptors does the filesystem have?
crash> inode.i_sb ffff88013f6e4d98
i_sb = 0xffff880335e60400
crash> super_block.s_fs_info 0xffff880335e60400
s_fs_info = 0xffff8803357aaa00
crash> ext3_sb_info.s_groups_count 0xffff8803357aaa00
s_groups_count = 24429
24429 - that's a lot. This filesystem must be huge.
crash> ext3_sb_info.s_blocks_per_group 0xffff8803357aaa00
s_blocks_per_group = 32768
crash> pd 32768*24429
$8 = 800489472
So that's about 3.2TB.
Okay so the problem here appears to be a scalability issue where it takes a long time to scan all the group descriptors for filesystem this large. Had the hung_task_panic setting not been enabled the scanning would have completed in about 3.5 minutes and the system would have continued. So disabling the hung_task_panic setting will prevent unnecessary panics.
There's still the question of why the filesystem was trying to scan so many group descriptors while allocating space for an extended attribute. Is the filesystem full or near full? If it is full then you could avoid this problem by freeing up some disk space. If it's not near full then could you provide the output of the dumpe2fs command? You will need to unmount the filesystem first and then run the command with the device as the argument.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.