Why does a system high-load occur? Load average is high

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 4, 5, 6, 7, 8, 9

Issue

  • Why does a system high-load occur?
  • System's load averages go up.
$ sar -q 
00:00:01      runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
<snip>
14:50:01            0       322      0.07      0.05      0.04
15:20:01            0       331      2.46     53.30     68.72
15:42:19           95       343    109.15    107.25     91.37
16:00:01            0       311     38.94     87.63     98.53
16:10:01            0       297      0.00     11.75     51.62
16:20:01            0       295      0.00      1.56     27.03

For 8 CPU server, load average is more than 8 and hence high:

$ uptime
 15:11:10 up 29 days, 20:57,  1 user,  load average: 43.41, 27.10, 21.40

Resolution

  • Get ps auxH and/or ps -eLo pid,tid,stat,wchan=WIDE-WCHAN-COLUMN,comm frequently, and see the results. Please note that a process may contain one or more threads. So top -b -n 1 and ps aux does not show all threads, and threads that are having R or D status may be hidden. In addition, WCHAN shown by the ps command may be useful to get root cause of D status. (However, WCHAN cannot work in Red Hat Enterprise Linux 8.)
  • Identify threads that are having R or D status and are getting high %CPU.

Root Cause

  • The load average is calculated by running or runnable thread (R) and uninterruptible sleep thread (D).
  • If the load average is less than the number of CPU cores/threads on the system, it indicates that the system is not overloaded, and there are enough resources to handle the workload. If the load average is higher than the number of CPU cores/threads, it suggests that the system is experiencing a higher demand for CPU resources than it can handle. This could lead to performance degradation and slower responsiveness.
  • Please see How is a load average calculated? for more details.

Diagnostic Steps

$ sar -q 
00:00:01      runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
<snip>
14:50:01            0       322      0.07      0.05      0.04
15:20:01            0       331      2.46     53.30     68.72
15:42:19           95       343    109.15    107.25     91.37
16:00:01            0       311     38.94     87.63     98.53
16:10:01            0       297      0.00     11.75     51.62
16:20:01            0       295      0.00      1.56     27.03

We can use the above sar example to understand what is happening.

ParameterExplanation
runq-szRun queue length, number of processes ready to run but waiting for run time
plist-szNumber of processes and threads in the process list
ldavg-1System load average for the last minute
ldavg-5System load average for the past 5 minutes
ldavg-15System load average for the past 15 minutes

runq-sz is the number of processes waiting for run time, in the previous example the runq-sz is 95 at 15:42:19 within sar data.
This may or not indicate a problem, until we locate how many CPU that server has.

cat /proc/cpuinfo | grep processor | wc -l
4

We can see that, the following server has 4 processors but has 95 processes awaiting to be run.
If this was occurring constantly, that would be indicating a problem with the server being overloaded.

  • In addition, we can also run the top command:
   top - 13:50:01 up 11:15,  2 users,  load average: 454.81, 158.64, 56.44
Tasks: 1032 total,   5 running, 724 sleeping,   7 stopped, 296 zombie
%Cpu(s):  6.2 us,  8.6 sy,  0.0 ni, 79.1 id,  0.0 wa,  0.1 hi,  5.9 si,  0.0 st
MiB Mem : 128303.1 total, 126553.6 free,    818.9 used,    930.6 buff/cache
MiB Swap:   4096.0 total,   4078.7 free,     17.3 used. 126309.1 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                
 422560 root      20   0   40408   2464      0 R  86.8   0.0   0:20.43 stress-ng                                                                                                                                                              
 422559 root      20   0   40408   2464      0 R  86.5   0.0   0:20.45 stress-ng                                                                                                                                                              
 422563 root      20   0   40408   2464      0 R  85.5   0.0   0:20.54 stress-ng                                                                                                                                                              
 422561 root      20   0   40408   2464      0 R  84.8   0.0   0:20.27 stress-ng                                                                                                                                                              
    895 root      20   0  257728 177032 160332 S  16.8   0.1   1:06.96 systemd-journal                                                                                                                                                        
     14 root      20   0       0      0      0 S   7.3   0.0   0:01.17 ksoftirqd/0                                                                                                                                                            
     20 root      20   0       0      0      0 S   5.0   0.0   0:01.15 ksoftirqd/1                                                                                                                                                            
     26 root      20   0       0      0      0 S   5.0   0.0   0:00.97 ksoftirqd/2                                                                                                                                                            
     31 root      20   0       0      0      0 S   5.0   0.0   0:00.85 ksoftirqd/3                                                                                                                                                            
     96 root      20   0       0      0      0 S   3.3   0.0   0:00.38 ksoftirqd/16                                                                                                                                                           
     86 root      20   0       0      0      0 S   3.0   0.0   0:00.50 ksoftirqd/14                                                                                                                                                           
     51 root      20   0       0      0      0 S   2.6   0.0   0:00.64 ksoftirqd/7                                                                                                                                                            
     36 root      20   0       0      0      0 S   2.3   0.0   0:00.57 ksoftirqd/4                                                                                                                                                            
     41 root      20   0       0      0      0 S   2.3   0.0   0:00.74 ksoftirqd/5                                                                                                                                                            
     81 root      20   0       0      0      0 S   2.3   0.0   0:00.75 ksoftirqd/13                                                                                                                                                           
     91 root      20   0       0      0      0 S   2.3   0.0   0:00.86 ksoftirqd/15       

In this case, there may be a problem for behavior of stress-ng process. Always refer to processes whose state is 'R' or 'D'

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.