Why does a system high-load occur? Load average is high
Environment
- Red Hat Enterprise Linux 4, 5, 6, 7, 8, 9
Issue
- Why does a system high-load occur?
- System's load averages go up.
$ sar -q
00:00:01 runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
<snip>
14:50:01 0 322 0.07 0.05 0.04
15:20:01 0 331 2.46 53.30 68.72
15:42:19 95 343 109.15 107.25 91.37
16:00:01 0 311 38.94 87.63 98.53
16:10:01 0 297 0.00 11.75 51.62
16:20:01 0 295 0.00 1.56 27.03
For 8 CPU server, load average is more than 8 and hence high:
$ uptime
15:11:10 up 29 days, 20:57, 1 user, load average: 43.41, 27.10, 21.40
Resolution
- Get
ps auxHand/orps -eLo pid,tid,stat,wchan=WIDE-WCHAN-COLUMN,commfrequently, and see the results. Please note that a process may contain one or more threads. Sotop -b -n 1andps auxdoes not show all threads, and threads that are having R or D status may be hidden. In addition,WCHANshown by thepscommand may be useful to get root cause of D status. (However,WCHANcannot work in Red Hat Enterprise Linux 8.) - Identify threads that are having R or D status and are getting high %CPU.
Root Cause
- The load average is calculated by running or runnable thread (R) and uninterruptible sleep thread (D).
- If the load average is less than the number of CPU cores/threads on the system, it indicates that the system is not overloaded, and there are enough resources to handle the workload. If the load average is higher than the number of CPU cores/threads, it suggests that the system is experiencing a higher demand for CPU resources than it can handle. This could lead to performance degradation and slower responsiveness.
- Please see How is a load average calculated? for more details.
Diagnostic Steps
$ sar -q
00:00:01 runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
<snip>
14:50:01 0 322 0.07 0.05 0.04
15:20:01 0 331 2.46 53.30 68.72
15:42:19 95 343 109.15 107.25 91.37
16:00:01 0 311 38.94 87.63 98.53
16:10:01 0 297 0.00 11.75 51.62
16:20:01 0 295 0.00 1.56 27.03
We can use the above sar example to understand what is happening.
| Parameter | Explanation |
|---|---|
| runq-sz | Run queue length, number of processes ready to run but waiting for run time |
| plist-sz | Number of processes and threads in the process list |
| ldavg-1 | System load average for the last minute |
| ldavg-5 | System load average for the past 5 minutes |
| ldavg-15 | System load average for the past 15 minutes |
runq-sz is the number of processes waiting for run time, in the previous example the runq-sz is 95 at 15:42:19 within sar data.
This may or not indicate a problem, until we locate how many CPU that server has.
cat /proc/cpuinfo | grep processor | wc -l
4
We can see that, the following server has 4 processors but has 95 processes awaiting to be run.
If this was occurring constantly, that would be indicating a problem with the server being overloaded.
- In addition, we can also run the top command:
top - 13:50:01 up 11:15, 2 users, load average: 454.81, 158.64, 56.44
Tasks: 1032 total, 5 running, 724 sleeping, 7 stopped, 296 zombie
%Cpu(s): 6.2 us, 8.6 sy, 0.0 ni, 79.1 id, 0.0 wa, 0.1 hi, 5.9 si, 0.0 st
MiB Mem : 128303.1 total, 126553.6 free, 818.9 used, 930.6 buff/cache
MiB Swap: 4096.0 total, 4078.7 free, 17.3 used. 126309.1 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
422560 root 20 0 40408 2464 0 R 86.8 0.0 0:20.43 stress-ng
422559 root 20 0 40408 2464 0 R 86.5 0.0 0:20.45 stress-ng
422563 root 20 0 40408 2464 0 R 85.5 0.0 0:20.54 stress-ng
422561 root 20 0 40408 2464 0 R 84.8 0.0 0:20.27 stress-ng
895 root 20 0 257728 177032 160332 S 16.8 0.1 1:06.96 systemd-journal
14 root 20 0 0 0 0 S 7.3 0.0 0:01.17 ksoftirqd/0
20 root 20 0 0 0 0 S 5.0 0.0 0:01.15 ksoftirqd/1
26 root 20 0 0 0 0 S 5.0 0.0 0:00.97 ksoftirqd/2
31 root 20 0 0 0 0 S 5.0 0.0 0:00.85 ksoftirqd/3
96 root 20 0 0 0 0 S 3.3 0.0 0:00.38 ksoftirqd/16
86 root 20 0 0 0 0 S 3.0 0.0 0:00.50 ksoftirqd/14
51 root 20 0 0 0 0 S 2.6 0.0 0:00.64 ksoftirqd/7
36 root 20 0 0 0 0 S 2.3 0.0 0:00.57 ksoftirqd/4
41 root 20 0 0 0 0 S 2.3 0.0 0:00.74 ksoftirqd/5
81 root 20 0 0 0 0 S 2.3 0.0 0:00.75 ksoftirqd/13
91 root 20 0 0 0 0 S 2.3 0.0 0:00.86 ksoftirqd/15
In this case, there may be a problem for behavior of stress-ng process. Always refer to processes whose state is 'R' or 'D'
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.