Processes hanging and causing system to crash, due to NFS server not responding.
Environment
- Red Hat Enterprise Linux 5, 6, 7, 8, 9
Issue
- NetApp NFS server not responding leads to process hangs in the client (RHEL) machine.
Resolution
-
The NFS server is a NetApp filer. The applications on the client machine went to D state, since the NFS server was not responding properly in time.
-
Check with the NFS server vendor (NetApp) to find out why the server didn't respond properly.
Root Cause
- The client dumped a vmcore, since 'hung_task_panic' was set. Dicing up the core revealed the following:
KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.18-308.el5/vmlinux
DUMPFILE: /cores/retrace/tasks/595400495/crash/vmcore [PARTIAL DUMP]
CPUS: 8
DATE: Tue Jan 21 05:13:02 2014
UPTIME: 111 days, 01:12:13
LOAD AVERAGE: 17.15, 10.18, 7.34
TASKS: 442
NODENAME: <hostname>
RELEASE: 2.6.18-308.el5
VERSION: #1 SMP Fri Jan 27 17:17:51 EST 2012
MACHINE: x86_64 (2333 Mhz)
MEMORY: 15.8 GB
PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"
- From the log, it seems that the system panicked because a task had been blocked for at least 2 minutes.
INFO: task perl:31376 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
perl D ffff810021036b20 0 31376 21705 31683 (NOTLB)
ffff8100562a5da8 0000000000000082 0000000000000000 ffffffff8892f9f8
0000000000000050 0000000000000008 ffff81040ecdc0c0 ffff81042feae080
00220573c8112272 0000000000000806 ffff81040ecdc2a8 00000006a7d3f978
Call Trace:
[<ffffffff8892f9f8>] :sunrpc:xprt_reserve+0x14d/0x161
[<ffffffff8006ecd9>] do_gettimeofday+0x40/0x90
[<ffffffff88a1e40d>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
[<ffffffff800637de>] io_schedule+0x3f/0x67
[<ffffffff88a1e416>] :nfs:nfs_wait_bit_uninterruptible+0x9/0xd
[<ffffffff80063a0a>] __wait_on_bit+0x40/0x6e
[<ffffffff88a1e40d>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
[<ffffffff80063aa4>] out_of_line_wait_on_bit+0x6c/0x78
[<ffffffff800a3472>] wake_bit_function+0x0/0x23
[<ffffffff88a23139>] :nfs:nfs_sync_inode_wait+0xe0/0x2d4
[<ffffffff88a18ecd>] :nfs:nfs_do_fsync+0x22/0x42
[<ffffffff80023c7b>] filp_close+0x36/0x64
[<ffffffff8001e103>] sys_close+0x88/0xbd
[<ffffffff800614b5>] sysenter_do_call+0x1e/0x76
Kernel panic - not syncing: hung_task: blocked tasks
- This particular task was flushing data to the NFS server and is waiting for all outstanding requests to complete.
crash> set 31376
crash> bt 31376
PID: 31376 TASK: ffff81040ecdc0c0 CPU: 6 COMMAND: "perl"
#0 [ffff8100562a5cd8] schedule at ffffffff80062fa0
#1 [ffff8100562a5db0] io_schedule at ffffffff800637de
#2 [ffff8100562a5dd0] nfs_wait_bit_uninterruptible at ffffffff88a1e416 [nfs]
#3 [ffff8100562a5de0] __wait_on_bit at ffffffff80063a0a
#4 [ffff8100562a5e20] out_of_line_wait_on_bit at ffffffff80063aa4
#5 [ffff8100562a5e90] nfs_sync_inode_wait at ffffffff88a23139 [nfs]
#6 [ffff8100562a5f20] nfs_do_fsync at ffffffff88a18ecd [nfs]
#7 [ffff8100562a5f40] filp_close at ffffffff80023c7b
#8 [ffff8100562a5f60] sys_close at ffffffff8001e103
#9 [ffff8100562a5f80] sysenter_do_call at ffffffff800614b5
RIP: 00000000ffffe410 RSP: 00000000ffb78eac RFLAGS: 00000296
RAX: ffffffffffffffda RBX: ffffffff800614b5 RCX: 0000000000000000
RDX: 0000000000000013 RSI: 000000000b2064b0 RDI: 0000000000000000
RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffb78ed8
R13: 0000000000000013 R14: ffffffff8001e103 R15: ffff810122816180
ORIG_RAX: 0000000000000006 CS: 0023 SS: 002b
-
It seems that the NFS server is not responding properly. This can happen if the NFS server is loaded at the time of the problem. There doesn't seem to be any logs about losing connectivity with the NFS server.
-
It appears the cause of the panic was that the hung task watchdog detected that a task bad been blocked for more than 2 minutes and the hung_task_panic setting was enabled. The task that was reported as blocked was waiting for a response from the NFS server. There are no messages to indicate that the NFS client lost connection with the server so it suggests the server was too busy to respond to the client before the hung task watchdog kicked in.
Diagnostic Steps
-
Checked the vmcore for the blocked process, and where it was waiting.
-
Checked the client (RHEL) logs to understand if there is a network disconnect from the server.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.