High load and a lot of zombie processes in OpenShift 4
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
- Zombie processes
Issue
- On the worker node, there are lots of zombie processes.
- The response time from the worker node is very high. An ssh session, could take up to 5 minutes to be established.
Resolution
A fix for an issue identified as a leak in CRI-O causing lots of zombie processes was released in OpenShift 4.10.3 and older releases, that are currently EOL and no longer supported.
In case of facing similar issue in newer releases, check the cause of the zombie processes following the "Diagnostic Steps" section, and in case of doubts This content is not included.open a new support case.
Workaround
While the issue is deeply investigated, as a workaround it is possible to identify the pods/containers generating zombie processes, and to restart them before reaching the limits that causes issues in the node. It is important to know if the application tolerates a pod/container to be restarted without application downtime.
Root Cause
Under high node pressure, exec probes failing cause many defunct conmon processes (where CRI-O is the root PID). Additional information can be found in the related BZs:
- This content is not included.Observing lot of defunct processes.
- This content is not included.OCP worker node high load CRI-O, systemd and python process and a lot of zombies.
Diagnostic Steps
-
Check the system load inside the node:
$ top PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1760 root 20 0 3899932 284484 35020 S 314.2 0.3 10819:46 crio 2522139 1972 20 0 565736 28432 9400 S 99.7 0.0 4562:29 python3.6 1 root 20 0 1001604 761708 9020 R 99.0 0.8 3679:46 systemd 82670 12000 20 0 6954496 4.1g 525692 S 36.8 4.3 601:06.50 hdbnameserver 1810 root 20 0 12.2g 4.7g 65072 S 17.2 5.0 2126:30 kubelet 229129 polkitd 20 0 11.5g 969876 22208 S 15.9 1.0 62:12.00 java 86051 polkitd 20 0 8490076 4.2g 149548 S 5.6 4.4 405:00.00 prometheus 3393 nfsnobo+ 20 0 722488 49188 10596 S 4.6 0.0 174:23.54 node_exporter 952 root 20 0 509000 328656 245028 S 3.3 0.3 118:05.80 systemd-journal 1462 openvsw+ 10 -10 1247220 97308 34684 S 1.3 0.1 98:00.46 ovs-vswitchd 209869 polkitd 20 0 230204 61060 8028 S 1.0 0.1 21:25.43 python 719937 8797 20 0 2359672 77840 47172 S 1.0 0.1 1:42.03 vsystem 1057827 core 20 0 71496 6852 4640 R 1.0 0.0 0:00.13 top -
Check if the defunct processes are not
conmonprocesses, and get thePPIDof the defunct processes:$ ps -elfL | grep "defunct" F S UID PID PPID LWP C NLWP PRI NI ADDR SZ WCHAN STIME TTY TIME CMD [...] 0 Z 1000870+ 11137 2517948 11137 0 1 80 0 - 0 - Feb26 ? 00:00:00 [sleep] <defunct> 0 Z 1000870+ 11148 2518113 11148 0 1 80 0 - 0 - Feb26 ? 00:00:00 [sleep] <defunct> 5 Z 1000870+ 11193 2518113 11193 0 1 80 0 - 0 - Feb26 ? 00:00:00 [readinessProbe.] <defunct> 0 Z 1000870+ 11195 2518113 11195 0 1 80 0 - 0 - Feb26 ? 00:00:00 [cat] <defunct> 0 Z 1000870+ 11354 2517948 11354 0 1 80 0 - 0 - Feb26 ? 00:00:00 [sleep] <defunct> [...]Note: the PPID is the fifth field as per the header "
F S UID PID PPID LWP C NLWP [...]" -
Search for those
PPIDsinpstreecommand:$ pstree -lp [...] |-conmon(2517680)---java(2517948)-+-cat(299119) | |-cat(366094) | |-cat(690949) [...] [...] | |-python(4128068) | |-python(4137536) | |-python(4153457) | |-readinessProbe.(29888) | |-readinessProbe.(38183) | |-readinessProbe.(38304) [...] [...] [...] [...] |-conmon(2517691)---java(2518113)-+-cat(11195) | |-cat(245198) | |-cat(297695) [...] [...] | |-grep(3973715) | |-grep(4066987) | |-python(33702) [...] [...] | |-python(4185895) | |-python(4193309) | |-readinessProbe.(11193) | |-readinessProbe.(81023) [...]
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.