NotReady node with high load and many D state processes in OpenShift 4
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
- Node
Issue
- There is a node in
NotReadystatus with very high load and many processes inDstate. - The load average shown for the node is very high, while there are no processes using excessive CPU.
Resolution
Follow the Diagnostic Steps section to find the cause of many processes in D status.
If the cause is the rpc_wait_bit_killable, it typically means the processes are waiting on a response from a NFS server, and checking if there is any issue with the NFS will be required.
To clear NFS blocked processes that have not automatically recovered, the system will need to be rebooted as explained in is there a way to kill a process in the 'Z' or 'D' state without a reboot?
>If operations do not complete due to a hardware or software fault (for example network connectivity problems) it may not be possible to eliminate all processes in D state without rebooting.
There is additional information about processes in D state in what is "D" state (or dstate, d-state)?
For other causes shown in the ps -elfL WCHAN column, please open a This content is not included.new Support Case for troubleshooting.
Root Cause
If there are many processes in D status due to rpc_wait_bit_killable, it typically means the processes are waiting on a response from a NFS server. When a NFS client is waiting on the server and doesn't receive a response, the processes will go into D state. The load average calculation includes blocked processes, so the load average will be very high and similar to the number of blocked processes.
Diagnostic Steps
-
Access the node via
oc debug node/[node_name]orssh. If accessed viaoc debug node, then runchroot /host bash). -
Check if the
uptimecommand shows a very high load average:# uptime 11:25:02 up 25 days, 8:45, 4 users, load average: 3524.92, 3524.04, 3523.24 -
Check the
WCHAN(waiting channel) column for the processes inDstate, which shows the name of the kernel function in which the process is sleeping:# ps -elfL | awk '{if($2~"D"){print $13}}' 3519 rpc_wa -
If the output shows
rpc_wa, it isrpc_wait_bit_killableand it typically means the processes are waiting on a response from a NFS server. -
Check if there are messages related to NFS in the
dmesg:$ grep -i nfs sos_commands/kernel/dmesg | wc -l 10750 $ grep -i nfs sos_commands/kernel/dmesg | less [...] [204462.604701] nfs: server 10.0.0.1 not responding, timed out [...] [9585786.855380] nfs: server 10.0.0.1 not responding, still trying
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.