High load and a lot of zombie processes in OpenShift 4

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4
  • Zombie processes

Issue

  • On the worker node, there are lots of zombie processes.
  • The response time from the worker node is very high. An ssh session, could take up to 5 minutes to be established.

Resolution

A fix for an issue identified as a leak in CRI-O causing lots of zombie processes was released in OpenShift 4.10.3 and older releases, that are currently EOL and no longer supported.

In case of facing similar issue in newer releases, check the cause of the zombie processes following the "Diagnostic Steps" section, and in case of doubts This content is not included.open a new support case.

Workaround

While the issue is deeply investigated, as a workaround it is possible to identify the pods/containers generating zombie processes, and to restart them before reaching the limits that causes issues in the node. It is important to know if the application tolerates a pod/container to be restarted without application downtime.

Root Cause

Under high node pressure, exec probes failing cause many defunct conmon processes (where CRI-O is the root PID). Additional information can be found in the related BZs:

Diagnostic Steps

  • Check the system load inside the node:

    $ top
        PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
       1760 root      20   0 3899932 284484  35020 S 314.2   0.3  10819:46 crio
    2522139 1972      20   0  565736  28432   9400 S  99.7   0.0   4562:29 python3.6
          1 root      20   0 1001604 761708   9020 R  99.0   0.8   3679:46 systemd
      82670 12000     20   0 6954496   4.1g 525692 S  36.8   4.3 601:06.50 hdbnameserver
       1810 root      20   0   12.2g   4.7g  65072 S  17.2   5.0   2126:30 kubelet
     229129 polkitd   20   0   11.5g 969876  22208 S  15.9   1.0  62:12.00 java
      86051 polkitd   20   0 8490076   4.2g 149548 S   5.6   4.4 405:00.00 prometheus
       3393 nfsnobo+  20   0  722488  49188  10596 S   4.6   0.0 174:23.54 node_exporter
        952 root      20   0  509000 328656 245028 S   3.3   0.3 118:05.80 systemd-journal
       1462 openvsw+  10 -10 1247220  97308  34684 S   1.3   0.1  98:00.46 ovs-vswitchd
     209869 polkitd   20   0  230204  61060   8028 S   1.0   0.1  21:25.43 python
     719937 8797      20   0 2359672  77840  47172 S   1.0   0.1   1:42.03 vsystem
    1057827 core      20   0   71496   6852   4640 R   1.0   0.0   0:00.13 top
    
  • Check if the defunct processes are not conmon processes, and get the PPID of the defunct processes:

    $ ps -elfL | grep "defunct"
    F S UID          PID    PPID     LWP  C NLWP PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
    [...]
    0 Z 1000870+   11137 2517948   11137  0    1  80   0 -     0 -      Feb26 ?        00:00:00 [sleep] <defunct>
    0 Z 1000870+   11148 2518113   11148  0    1  80   0 -     0 -      Feb26 ?        00:00:00 [sleep] <defunct>
    5 Z 1000870+   11193 2518113   11193  0    1  80   0 -     0 -      Feb26 ?        00:00:00 [readinessProbe.] <defunct>
    0 Z 1000870+   11195 2518113   11195  0    1  80   0 -     0 -      Feb26 ?        00:00:00 [cat] <defunct>
    0 Z 1000870+   11354 2517948   11354  0    1  80   0 -     0 -      Feb26 ?        00:00:00 [sleep] <defunct>
    [...]
    

    Note: the PPID is the fifth field as per the header "F S UID PID PPID LWP C NLWP [...]"

  • Search for those PPIDs in pstree command:

    $ pstree -lp
    [...]
               |-conmon(2517680)---java(2517948)-+-cat(299119)
               |                                 |-cat(366094)
               |                                 |-cat(690949)
               [...]                             [...]
               |                                 |-python(4128068)
               |                                 |-python(4137536)
               |                                 |-python(4153457)
               |                                 |-readinessProbe.(29888)
               |                                 |-readinessProbe.(38183)
               |                                 |-readinessProbe.(38304)
               [...]                             [...]
               [...]                             [...]
               |-conmon(2517691)---java(2518113)-+-cat(11195)
               |                                 |-cat(245198)
               |                                 |-cat(297695)
               [...]                             [...]
               |                                 |-grep(3973715)
               |                                 |-grep(4066987)
               |                                 |-python(33702)
               [...]                             [...]
               |                                 |-python(4185895)
               |                                 |-python(4193309)
               |                                 |-readinessProbe.(11193)
               |                                 |-readinessProbe.(81023)
    [...]
    
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.