NFS remove operations taking a long time and tripping hung_task backtraces and/or hung_task panics
Environment
- Red Hat Enterprise Linux 6, 7, 8, 9 (NFS client)
- seen with NetApp filer (NFS server) and large file removes
Issue
- experiencing hung_task panics or in general slow NFS remove operations
- removing a NFS file is slow with NetApp NFS server
Resolution
-
Contact NFS server vendor for performance analysis and/or tuning.
-
Workarounds
- Ensure any NFS client operation is tolerant of NFS REMOVE operations which may take a very long time.
- Throttle the application removing files so that a huge load of NFS REMOVE operations are not sent to the NFS server at the same time.
- If
hung_task_panicorhung_taskchecks are used, configurehung_task_timeout_secsto be be a larger value than the default of 120 seconds. Configure a value which is believed to be: a) greater than the worst case time for a single NFS REMOVE operation, and b) greater than the sum of times for simultaneous removes in a single NFS directory (processes removing files at the same time in the same directory will wait in sequence for the directory inode mutex, so may trip the hung task panic).
Root Cause
- File remove operations may require a significant amount of background work from an NFS server, so they may take an extended period of time.
- This may be seen with multi-threaded cleanup jobs, which remove a lot of files in parallel in a short period of time.
Diagnostic Steps
- Gather a tcpdump during a time when NFS REMOVE operations are taking place. Process the workload with tshark or other tool which shows latency of NFS operations looking for either: a) a lot of NFS REMOVE operations, or b) NFS REMOVE operations which have a large maximum or average latency (with tshark this is represented by "Max RTT" / "Max SRT" and "Avg RTT" / "Avg SRT" columns). For example, using How to use tshark to generate statistics about RPC calls on a Red Hat Enterprise Linux 5 system
$ tshark -q -z rpc,rtt,100003,3 -r tcpdump.cap > /tmp/tcpdump-rpc-stats.txt
$ egrep '(^REMOVE|Procedure)' /tmp/tcpdump-rpc-stats.txt
Procedure Calls Min RTT Max RTT Avg RTT
REMOVE 197 3.00012 129.059 35.0018
- On NFS client, configure hung_task_panic to capture a vmcore when any task is blocked for a significant amount of time. For more information on configuration of hung_task_panic, see How do I use hung task check?.
Detailed analysis (from vmcore)
- Analyze the vmcore for evidence that the hung_task timer was tripped due to the underlying reason of a slow NFS REMOVE operation. Note that the task which trips the hung_task timeout may not be the one which is directly waiting on NFS REMOVE operation, but it may be "waiting in line" for a resource (such as a directory inode mutex) held by another process which is ultimately waiting for the NFS REMOVE operation to complete.
crash> sys | grep PANIC
PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"
crash> sys | grep RELEASE
RELEASE: 2.6.32-358.12.1.el6.x86_64
crash> p sysctl_hung_task_timeout_secs
sysctl_hung_task_timeout_secs = $4 = 300
crash> log | tail -40
NFS: v4 server nfsserver01 does not accept raw uid/gids. Reenabling the idmapper.
NFS: v4 server nfsserver01 does not accept raw uid/gids. Reenabling the idmapper.
NFS: v4 server nfsserver04 does not accept raw uid/gids. Reenabling the idmapper.
NFS: v4 server nfsserver04 does not accept raw uid/gids. Reenabling the idmapper.
NFS: v4 server nfsserver04 does not accept raw uid/gids. Reenabling the idmapper.
NFS: v4 server nfsserver01 does not accept raw uid/gids. Reenabling the idmapper.
NFS: v4 server nfsserver02 does not accept raw uid/gids. Reenabling the idmapper.
NFS: v4 server nfsserver01 does not accept raw uid/gids. Reenabling the idmapper.
NFS: v4 server nfsserver04 does not accept raw uid/gids. Reenabling the idmapper.
NFS: v4 server nfsserver01 returned a bad sequence-id error!
NFS: v4 server nfsserver03 does not accept raw uid/gids. Reenabling the idmapper.
NFS: v4 server nfsserver05 does not accept raw uid/gids. Reenabling the idmapper.
nfs: server nfsserver03 not responding, still trying
INFO: task pmserver:8188 blocked for more than 300 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
pmserver D 0000000000000015 0 8188 7181 0x00000080
ffff880f1ca77d38 0000000000000082 0000000000000000 0000000000000001
ffff880f1ca77cc8 ffffffffa039e4c2 ffff880b77e56026 ffff880f1ca77e28
ffff880cda521058 ffff880f1ca77fd8 000000000000fb88 ffff880cda521058
Call Trace:
[<ffffffffa039e4c2>] ? nfs_permission+0xb2/0x1e0 [nfs]
[<ffffffff8150f2ae>] __mutex_lock_slowpath+0x13e/0x180
[<ffffffff8150f14b>] mutex_lock+0x2b/0x50
[<ffffffff81192b77>] do_filp_open+0x2d7/0xdc0
[<ffffffff8113deb0>] ? sys_madvise+0x350/0x7b0
[<ffffffff8119f622>] ? alloc_fd+0x92/0x160
[<ffffffff8117dfb9>] do_sys_open+0x69/0x140
[<ffffffff8100c535>] ? math_state_restore+0x45/0x60
[<ffffffff8117e0d0>] sys_open+0x20/0x30
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Kernel panic - not syncing: hung_task: blocked tasks
Pid: 257, comm: khungtaskd Not tainted 2.6.32-358.12.1.el6.x86_64 #1
Call Trace:
[<ffffffff8150d538>] ? panic+0xa7/0x16f
[<ffffffff810e0ae7>] ? watchdog+0x217/0x220
[<ffffffff810e08d0>] ? watchdog+0x0/0x220
[<ffffffff81096936>] ? kthread+0x96/0xa0
[<ffffffff8100c0ca>] ? child_rip+0xa/0x20
[<ffffffff810968a0>] ? kthread+0x0/0xa0
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20
crash> set 8188
PID: 8188
COMMAND: "pmserver"
TASK: ffff880cda520aa0 [THREAD_INFO: ffff880f1ca76000]
CPU: 21
STATE: TASK_UNINTERRUPTIBLE
crash> bt
PID: 8188 TASK: ffff880cda520aa0 CPU: 21 COMMAND: "pmserver"
#0 [ffff880f1ca77c78] schedule at ffffffff8150dc02
#1 [ffff880f1ca77d40] __mutex_lock_slowpath at ffffffff8150f2ae
#2 [ffff880f1ca77db0] mutex_lock at ffffffff8150f14b
#3 [ffff880f1ca77dd0] do_filp_open at ffffffff81192b77
#4 [ffff880f1ca77f20] do_sys_open at ffffffff8117dfb9
#5 [ffff880f1ca77f70] sys_open at ffffffff8117e0d0
#6 [ffff880f1ca77f80] system_call_fastpath at ffffffff8100b072
RIP: 00007fe3273e3d8d RSP: 00007fe274df9c68 RFLAGS: 00010246
RAX: 0000000000000002 RBX: ffffffff8100b072 RCX: 0000000000000000
RDX: 00000000000001b6 RSI: 0000000000000442 RDI: 00007fe258006130
RBP: 0000000000000000 R8: 00007fe258000088 R9: 0000000000000001
R10: 000000000000006f R11: 0000000000000293 R12: ffffffff8117e0d0
R13: ffff880f1ca77f78 R14: 00007fe274df9d30 R15: 00007fe274df9c90
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
Get the mutex and the owner.
crash> dis -lr ffffffff8150f14b | tail
0xffffffff8150f136 <mutex_lock+22>: mov %rdi,%rbx
/usr/src/debug/kernel-2.6.32-358.12.1.el6/linux-2.6.32-358.12.1.el6.x86_64/kernel/mutex.c: 94
0xffffffff8150f139 <mutex_lock+25>: callq 0xffffffff8150e5e0 <_cond_resched>
/usr/src/debug/kernel-2.6.32-358.12.1.el6/linux-2.6.32-358.12.1.el6.x86_64/kernel/mutex.c: 99
0xffffffff8150f13e <mutex_lock+30>: mov %rbx,%rdi <---- mutex *
0xffffffff8150f141 <mutex_lock+33>: lock decl (%rdi)
0xffffffff8150f144 <mutex_lock+36>: jns 0xffffffff8150f14b <mutex_lock+43>
0xffffffff8150f146 <mutex_lock+38>: callq 0xffffffff8150f170 <__mutex_lock_slowpath>
/usr/src/debug/kernel-2.6.32-358.12.1.el6/linux-2.6.32-358.12.1.el6.x86_64/arch/x86/include/asm/thread_info.h: 216
0xffffffff8150f14b <mutex_lock+43>: mov %gs:0xcbc8,%rax
kernel/mutex.c
92 void __sched mutex_lock(struct mutex *lock)
93 {
94 might_sleep();
95 /*
96 * The locking fastpath is the 1->0 transition from
97 * 'unlocked' into 'locked' state.
98 */
99--> __mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);
100 mutex_set_owner(lock);
101 }
crash> struct -o mutex | grep count
[0] atomic_t count;
crash> dis __mutex_lock_slowpath | head
0xffffffff8150f170 <__mutex_lock_slowpath>: push %rbp
0xffffffff8150f171 <__mutex_lock_slowpath+1>: mov %rsp,%rbp
0xffffffff8150f174 <__mutex_lock_slowpath+4>: push %r15
0xffffffff8150f176 <__mutex_lock_slowpath+6>: push %r14
0xffffffff8150f178 <__mutex_lock_slowpath+8>: push %r13
0xffffffff8150f17a <__mutex_lock_slowpath+10>: push %r12
0xffffffff8150f17c <__mutex_lock_slowpath+12>: push %rbx <------ mutex * 6th push after RA
crash> bt | grep mutex_lock
#1 [ffff880f1ca77d40] __mutex_lock_slowpath at ffffffff8150f2ae
#2 [ffff880f1ca77db0] mutex_lock at ffffffff8150f14b
crash> px (0xffff880f1ca77db0 - 6*8)
$2 = 0xffff880f1ca77d80
crash> rd 0xffff880f1ca77d80
ffff880f1ca77d80: ffff88000f4949d8 .II.....
crash> mutex.owner ffff88000f4949d8
owner = 0xffff88099c3de000
crash> thread_info.task 0xffff88099c3de000
task = 0xffff88091c238040
crash> bt 0xffff88091c238040
PID: 7217 TASK: ffff88091c238040 CPU: 9 COMMAND: "pmserver"
#0 [ffff88099c3dfa68] schedule at ffffffff8150dc02
#1 [ffff88099c3dfb30] rpc_wait_bit_killable at ffffffffa031d4e2 [sunrpc]
#2 [ffff88099c3dfb40] __wait_on_bit at ffffffff8150ed9f
#3 [ffff88099c3dfb90] out_of_line_wait_on_bit at ffffffff8150ee48
#4 [ffff88099c3dfc00] __rpc_execute at ffffffffa031da25 [sunrpc]
#5 [ffff88099c3dfc70] rpc_execute at ffffffffa031dce1 [sunrpc]
#6 [ffff88099c3dfca0] rpc_run_task at ffffffffa03143a5 [sunrpc]
#7 [ffff88099c3dfcc0] rpc_call_sync at ffffffffa03144c2 [sunrpc]
#8 [ffff88099c3dfd20] nfs3_rpc_wrapper.clone.0 at ffffffffa03b5f7d [nfs]
#9 [ffff88099c3dfd60] nfs3_proc_remove at ffffffffa03b65ab [nfs]
#10 [ffff88099c3dfe20] nfs_unlink at ffffffffa039ea98 [nfs]
#11 [ffff88099c3dfe70] vfs_unlink at ffffffff8118fb5f
#12 [ffff88099c3dfea0] do_unlinkat at ffffffff81191ee5
#13 [ffff88099c3dff70] sys_unlink at ffffffff81191fb6
#14 [ffff88099c3dff80] system_call_fastpath at ffffffff8100b072
RIP: 00007fe31e4cc7d7 RSP: 00007fe2997fa8d0 RFLAGS: 00000246
RAX: 0000000000000057 RBX: ffffffff8100b072 RCX: 00007fe31cf2b428
RDX: 000000000000006c RSI: 00007fe27883a870 RDI: 00007fe27883a870
RBP: 00007fe2781f7fb0 R8: 00007fe278000088 R9: 00007fe2997faa01
R10: 000000000000006d R11: 0000000000000202 R12: ffffffff81191fb6
R13: ffff88099c3dff78 R14: 00007fe290cd0ac0 R15: 0000000000000001
ORIG_RAX: 0000000000000057 CS: 0033 SS: 002b
Get the rpc task
crash> dis -r ffffffffa03193a5 | tail -4
0xffffffffa0319398 <rpc_run_task+104>: lock incl (%r12)
0xffffffffa031939d <rpc_run_task+109>: mov %r12,%rdi <--- rpc_task *
0xffffffffa03193a0 <rpc_run_task+112>: callq 0xffffffffa0322c80 <rpc_execute>
0xffffffffa03193a5 <rpc_run_task+117>: pop %rbx
crash> dis rpc_execute | head
0xffffffffa0322c80 <rpc_execute>: push %rbp
0xffffffffa0322c81 <rpc_execute+1>: mov %rsp,%rbp <--- rbp from %rsp
0xffffffffa0322c84 <rpc_execute+4>: sub $0x20,%rsp
0xffffffffa0322c88 <rpc_execute+8>: mov %rbx,-0x18(%rbp)
0xffffffffa0322c8c <rpc_execute+12>: mov %r12,-0x10(%rbp) <---- rpc_task *
crash> set 7217
PID: 7217
COMMAND: "pmserver"
TASK: ffff88091c238040 [THREAD_INFO: ffff88099c3de000]
CPU: 9
STATE: TASK_UNINTERRUPTIBLE|TASK_WAKEKILL
crash> bt | grep rpc_run_task
#6 [ffff88099c3dfca0] rpc_run_task at ffffffffa03143a5 [sunrpc]
crash> px (0xffff88099c3dfca0 - 1*8 - 0x10)
$3 = 0xffff88099c3dfc88
crash> rd 0xffff88099c3dfc88
ffff88099c3dfc88: ffff88100bc6de80 ........
crash> source ~/crash/nfs/rhel6-rpc-show-tasks
crash> rhel-rpc-task-summary 0xffff88100bc6de80
(struct rpc_task *) 0xffff88100bc6de80
rpc_task.tk_rqstp = (struct rpc_rqst *)0xffff881011b76e00 rpc_rqst.rq_xprt = (struct rpc_xprt *)0xffff880810874800
rpc_rqst: rq_xid = 0xb4c69fa9
rpc_rqst: rq_retries = 1, rq_ntrans = 6, rq_bytes_sent = 0, (last_jiffies_update - rq_xtime)/10^9 = 13 seconds
(last_jiffies_update - tk_start)/10^9 = 433 seconds, tk_start = 1201721962889157, tk_timeouts = 2
tk_owner = 7182, tk_flags = 0x8a0, tk_status = 0, tk_runstate = 6, tk_timeout = 120000
tk_msg.rpc_proc->pname = REMOVE
(struct nfs_removeargs *)0xffff88099c3dfdb8 (struct nfs_removeres *)0xffff88099c3dfd68
tk_action = 0xffffffffa0313410, &u.tk_work = 0xffff88100bc6df08, tk_workqueue = (nil)
tk_priority = 1, tk_waitqueue = 0xffff880810874a58 tk_waitqueue.name = xprt_pending
rpc_task.tk_client = 0xffff880811fad800 rpc_task.tk_client->cl_server = nfsserver03
The rq_ntrans value of 6 indicates it's been transmitted 6 times. The last
time it was transmitted was 13 seconds ago. The total time this NFS REMOVE
has been outstanding is 433 seconds, which exceeds sysctl_hung_task_timeout_secs.
The tk_timeouts value of 2 indicates the task has encountered 2 major
timeouts, hence the "not responding, still trying" messages.
crash> log | grep nfsserver03
NFS: v4 server nfsserver03 does not accept raw uid/gids. Reenabling the idmapper.
NFS: v4 server nfsserver03 does not accept raw uid/gids. Reenabling the idmapper.
nfs: server nfsserver03 not responding, still trying
crash> source ~/crash/nfs/rhel6-rpc-xprt-summary
crash> rhel6-rpc-xprt-summary 0xffff880810874800
RPC Transport State Summary for (struct rpc_xprt *)0xffff880810874800
=====================================================================
rpc_xprt.state = 0x12 (For details, use 'rhel-rpc-xprt-states')
- OK: rpc_xprt.state bit 2 set (XPRT_CONNECTED)
rpc_xprt.prot = 6 (TCP)
- sock.__sk_common.skc_state = 1 (See 'rhel-tcp-states' for TCP state list)
- OK: skc_state == 1 (TCP_ESTABLISHED)
RPC Transport other structures (struct rpc_xprt *)0xffff880810874800
=====================================================================
xprt.inet = (struct sock *)0xffff880812af4740
xprt.sock = (struct socket *)0xffff8808111a26c0
xprt.sock = (struct tcp_sock *)0xffff880812af4740
rpc_xprt.xprt_net = (struct net *)0xffffffff81b16580
Slot table free list: (struct list_head *) 0xffff880810874be8
Slot table entries: tcp = 2, udp = 16
snd_task = (struct rpc_task *)(nil)
RPC wait queues
- binding (struct rpc_wait_queue *) 0xffff8808108748c8: qlen = 0, owner = 0
- sending (struct rpc_wait_queue *) 0xffff880810874990: qlen = 0, owner = 0
- backlog (struct rpc_wait_queue *) 0xffff880810874b20: qlen = 0, owner = 0
- pending (struct rpc_wait_queue *) 0xffff880810874a58: qlen = 1, owner = 0
non-xprt queues
- delay_queue (struct rpc_wait_queue *)0xffffffffa033db00 qlen = 0
SBR
Product(s)
Components
Category
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.