How to troubleshoot Out of memory (OOM) killer in Red Hat Enterprise Linux

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 8
  • Red Hat Enterprise Linux 7
  • Red Hat Enterprise Linux 6
  • Red Hat OpenStack Platform 9.0
  • Red Hat OpenStack Platform 8.0
  • Red Hat OpenStack Platform 7.0

Issue

  • The Out of memory (OOM) killer daemon is killing active processes.
  • Services on Red Hat OpenStack Platform nodes are randomly dying.
  • Memory usage seems to be high on Red Hat OpenStack Platform nodes.
  • Available memory on Red Hat OpenStack Platform nodes seems to be low.
  • There are Out of memory: Kill process 43805 (keystone-all) score 249 or sacrifice child noticed in the logs:
[Wed Aug 31 16:36:42 2016] Out of memory: Kill process 43805 (keystone-all) score 249 or sacrifice child
[Wed Aug 31 16:36:42 2016] Killed process 43805 (keystone-all) total-vm:4446352kB, anon-rss:4053140kB, file-rss:68kB
[Wed Aug 31 16:56:25 2016] keystone-all invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[Wed Aug 31 16:56:25 2016] keystone-all cpuset=/ mems_allowed=0
[Wed Aug 31 16:56:25 2016] CPU: 2 PID: 88196 Comm: keystone-all Not tainted 3.10.0-327.13.1.el7.x86_64 #1

Resolution

  • When troubleshooting why the Out of memory (OOM) killer process starts up, one must look at a few factors on the system. Generally, OOM killer starts because of a handful of reasons:

    Reason | Probable Cause
    -|-
    1 | Spike in memory usage based on a load event (additional processes are needed for increased load).
    2 | Spike in memory usage based on additional services being added or migrated to the system. (Added another app or started a new service on the system)
    3 | Spike in memory usage due to failed hardware such as a DIMM memory module.
    4 | Spike in memory usage due to undersizing of hardware resources for the running application(s).
    5 | There's a memory leak in a running application.


Mitigation

Troubleshooting

  • Check to see how often the Out of memory (OOM) killer process is running. This will give you a gauge of how frequent the memory constraint is being hit. In this example, we're seeing OOM killer hit a process as often as every 10 minutes. This is quite frequent:

    $ egrep 'Out of memory:' /var/log/messages 
    Aug 31 10:57:27 hostname kernel: Out of memory: Kill process 121952 (glance-api) score 72 or sacrifice child
    Aug 31 10:57:29 hostname kernel: Out of memory: Kill process 55333 (keystone-all) score 79 or sacrifice child
    Aug 31 10:57:31 hostname kernel: Out of memory: Kill process 4106 (keystone-all) score 98 or sacrifice child
    Aug 31 10:57:34 hostname kernel: Out of memory: Kill process 54961 (keystone-all) score 130 or sacrifice child
    Aug 31 10:57:41 hostname kernel: Out of memory: Kill process 86345 (keystone-all) score 196 or sacrifice child
    Aug 31 10:58:54 hostname kernel: Out of memory: Kill process 121731 (keystone-all) score 260 or sacrifice child
    Aug 31 11:09:44 hostname kernel: Out of memory: Kill process 9495 (keystone-all) score 261 or sacrifice child
    Aug 31 11:12:15 hostname kernel: Out of memory: Kill process 23113 (keystone-all) score 261 or sacrifice child
    Aug 31 14:09:27 hostname kernel: Out of memory: Kill process 43547 (keystone-all) score 261 or sacrifice child
    Aug 31 14:21:15 hostname kernel: Out of memory: Kill process 43743 (keystone-all) score 181 or sacrifice child
    Aug 31 14:29:45 hostname kernel: Out of memory: Kill process 58542 (keystone-all) score 209 or sacrifice child
    Aug 31 14:40:27 hostname kernel: Out of memory: Kill process 43545 (keystone-all) score 105 or sacrifice child
    Aug 31 14:47:30 hostname kernel: Out of memory: Kill process 56050 (keystone-all) score 125 or sacrifice child
    Aug 31 14:55:00 hostname kernel: Out of memory: Kill process 35892 (keystone-all) score 152 or sacrifice child
    Aug 31 15:07:52 hostname kernel: Out of memory: Kill process 75726 (keystone-all) score 198 or sacrifice child
    Aug 31 15:31:16 hostname kernel: Out of memory: Kill process 43805 (keystone-all) score 249 or sacrifice child
    Aug 31 15:50:59 hostname kernel: Out of memory: Kill process 88196 (keystone-all) score 507 or sacrifice child
    Sep  2 12:35:54 hostname kernel: Out of memory: Kill process 52446 (keystone-all) score 473 or sacrifice child
    Sep  2 18:43:33 hostname kernel: Out of memory: Kill process 97213 (keystone-all) score 166 or sacrifice child
    Sep  2 18:54:48 hostname kernel: Out of memory: Kill process 31716 (keystone-all) score 178 or sacrifice child
    Sep  2 19:09:12 hostname kernel: Out of memory: Kill process 45010 (keystone-all) score 148 or sacrifice child
    Sep  2 19:21:41 hostname kernel: Out of memory: Kill process 7073 (keystone-all) score 191 or sacrifice child
    Sep  2 19:47:21 hostname kernel: Out of memory: Kill process 116139 (keystone-all) score 279 or sacrifice child
    Sep  2 20:33:12 hostname kernel: Out of memory: Kill process 108621 (keystone-all) score 460 or sacrifice child
    
  • Check to see how large the memory consumption is of the processes being killed. Here we see the processes killed are consuming anywhere from 1-8GB of memory:

    $ egrep 'total-vm' /var/log/messages
    Aug 31 10:57:27 hostname kernel: Killed process 121952 (glance-api) total-vm:5698360kB, anon-rss:1174988kB, file-rss:0kB
    Aug 31 10:57:29 hostname kernel: Killed process 55333 (keystone-all) total-vm:1681340kB, anon-rss:1292656kB, file-rss:564kB
    Aug 31 10:57:31 hostname kernel: Killed process 4106 (keystone-all) total-vm:1979684kB, anon-rss:1591104kB, file-rss:4kB
    Aug 31 10:57:34 hostname kernel: Killed process 54961 (keystone-all) total-vm:2496376kB, anon-rss:2108056kB, file-rss:568kB
    Aug 31 10:57:41 hostname kernel: Killed process 86345 (keystone-all) total-vm:3580248kB, anon-rss:3191708kB, file-rss:348kB
    Aug 31 10:58:54 hostname kernel: Killed process 121731 (keystone-all) total-vm:4620304kB, anon-rss:4231956kB, file-rss:0kB
    Aug 31 11:09:44 hostname kernel: Killed process 9495 (keystone-all) total-vm:4626764kB, anon-rss:4238416kB, file-rss:496kB
    Aug 31 11:12:15 hostname kernel: Killed process 23113 (keystone-all) total-vm:4631276kB, anon-rss:4242928kB, file-rss:0kB
    Aug 31 14:09:27 hostname kernel: Killed process 43547 (keystone-all) total-vm:4631904kB, anon-rss:4243560kB, file-rss:0kB
    Aug 31 14:21:15 hostname kernel: Killed process 43743 (keystone-all) total-vm:3336616kB, anon-rss:2943280kB, file-rss:72kB
    Aug 31 14:29:45 hostname kernel: Killed process 58542 (keystone-all) total-vm:3788528kB, anon-rss:3395048kB, file-rss:60kB
    Aug 31 14:40:27 hostname kernel: Killed process 43545 (keystone-all) total-vm:2096368kB, anon-rss:1705116kB, file-rss:0kB
    Aug 31 14:47:30 hostname kernel: Killed process 56050 (keystone-all) total-vm:2430728kB, anon-rss:2040104kB, file-rss:0kB
    Aug 31 14:55:00 hostname kernel: Killed process 35892 (keystone-all) total-vm:2866124kB, anon-rss:2473448kB, file-rss:68kB
    Aug 31 15:07:52 hostname kernel: Killed process 75726 (keystone-all) total-vm:3605284kB, anon-rss:3211852kB, file-rss:68kB
    Aug 31 15:31:16 hostname kernel: Killed process 43805 (keystone-all) total-vm:4446352kB, anon-rss:4053140kB, file-rss:68kB
    Aug 31 15:50:59 hostname kernel: Killed process 88196 (keystone-all) total-vm:8600652kB, anon-rss:8224512kB, file-rss:0kB
    
  • Further analysis, we can check the system activity reporter (SAR) data to see what it's captured about the OS. In this example, we see the previous memory constraint is consistent with the system activity reporter data. Sar shows low CPU utilization (good), low IOwait (good), but very high memory utilization and commit percentages (a concern). In high usage times, the commit% is higher than 100, indicating the system is overcommitting memory and is memory starved:

    - Check swap statistics with the `-S` flag:  A high % of `swpused` indicates swapping and memory shortage: (these may not be present if there's no swap partition or file)
    
    
      ```
      $ sar -S -f /var/log/sa/sa23 
      Linux 3.10.0-327.22.2.el7.x86_64 (hostname) 	23/08/16 	_x86_64_	(8 CPU)
               12:00:01 AM kbswpfree kbswpused  %swpused  kbswpcad   %swpcad
               12:10:01 AM  58883728   8225132     12.26   1998956     24.30
               12:20:04 AM  58864012   8244848     12.29   1249296     15.15
               12:30:01 AM  58806528   8302332     12.37   1205484     14.52
               12:40:02 AM  58637300   8471560     12.62   1111648     13.12
               12:50:01 AM  58589064   8519796     12.70   1095492     12.86
               01:00:02 AM  58540448   8568412     12.77   1097672     12.81
               01:10:09 AM  58543808   8565052     12.76   1108248     12.94
               01:20:01 AM  58531484   8577376     12.78   1150284     13.41
      ```
    
    
    - Additionally, `vmstat` can be used to live check active swapping. Values in the `si` and `so` columns indicate memory being swapped in and out:
    
    
      ```
      $ vmstat 1 10
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       2  0 12370668 241988      0 636080   39   27    63   192    6    8 23  6 71  0  0
       3  0 12369380 239904      0 635984 1588    0  1588  1444 14312 13248 25 11 64  0  0
       2  0 12369376 238704      0 636264    0    0     0    48 12465 9823 24  8 68  0  0
       2  0 12369376 238936      0 636236   24    0    24  1515 13101 10377 24  8 68  0  0
      5  0 12354472 240280      0 634740    0    0     0   104 13935 11042 27  8 65  0  0
       5  0 12353792 241768      0 634144 1308    0  1404    40 20061 9469 42  8 50  0  0
       4  0 12353704 239588      0 634728  112    0   112    56 14098 10500 26  8 65  0  0
      4  0 12349164 228920      0 634624 4860    0  5308  4107 15228 14442 31 10 59  1  0
       5  0 12369364 261104      0 636192   68    0    68    64 13456 9894 24  9 67  0  0
       2  0 12369356 262804      0 636228   32    0    96  1487 13805 10328 27  8 65  0  0
      ```
    
    
    - Check CPU and IOwait statistics:  High `%user` or `%system` indicate a busy system, also high `%iowait` the system is spending important time waiting on underlying storage:
    
    
      ```
       $ sar -f /var/log/sa/sa31 
       Linux 3.10.0-327.13.1.el7.x86_64 (hostname) 	30/08/16 	_x86_64_	(8 CPU)
       11:00:01 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
       11:10:01 PM     all     18.04      0.00      1.11      0.00      0.00     80.85
       11:20:01 PM     all     18.60      0.00      1.10      0.00      0.00     80.29
       11:30:01 PM     all     18.61      0.00      1.20      0.00      0.00     80.18
       11:40:01 PM     all     12.40      0.00      1.02      0.00      0.00     86.58
       11:50:01 PM     all     13.22      0.00      1.07      0.00      0.00     85.71
       12:00:01 AM     all     15.40      0.00      1.09      0.00      0.00     83.51
       12:10:01 AM     all     13.10      0.00      1.01      0.00      0.00     85.89
       12:20:01 AM     all     12.16      0.00      0.90      0.00      0.00     86.94
       12:30:01 AM     all     11.23      0.00      0.88      0.00      0.00     87.89
       12:40:01 AM     all     10.69      0.00      0.89      0.00      0.00     88.42
       12:50:01 AM     all     10.05      0.00      0.87      0.00      0.00     89.08
       01:00:01 AM     all     11.08      0.00      0.88      0.00      0.00     88.04
       01:10:01 AM     all     10.45      0.00      0.88      0.00      0.00     88.67
       01:20:01 AM     all     11.22      0.00      0.89      0.00      0.00     87.88
      ```
    
    
    - Check memory statistics: High `%memused` and `%commit` values tell us the system is using nearly all of its memory, and memory that is committed to processes (high `%commit` is more concerning):
    
    
      ```
               $ sar -r -f /var/log/sa/sa31 
               Linux 3.10.0-327.13.1.el7.x86_64 (hostname) 	30/08/16 	_x86_64_	(8 CPU)
      09:00:01 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
      09:10:01 PM   1023648  15229604     93.70       288   4610776  13528944     83.24  13580232   1120648       544
      09:20:01 PM    864644  15388608     94.68       288   4584948  13720124     84.41  13617100   1240576       960
      09:30:01 PM   1409040  14844212     91.33       288   4525700  13388272     82.37  13126048   1185404       776
      09:40:01 PM   1059396  15193856     93.48       288   4548180  13519208     83.18  13461268   1195824      2140
      09:50:01 PM    859520  15393732     94.71       288   4560488  13560828     83.43  13626728   1231500      1412
      10:00:01 PM    841528  15411724     94.82       288   4581020  13569344     83.49  13600780   1274472      1212
      10:10:01 PM    833488  15419764     94.87       288   4589920  13576600     83.53  13576756   1306736      1204
      10:20:01 PM    835664  15417588     94.86       288   4588580  13571876     83.50  13554340   1329356      2148
      10:30:01 PM    837548  15415704     94.85       288   4583996  13574336     83.52  13533756   1348220      1140
      10:40:01 PM    882460  15370792     94.57       288   4546476  13576156     83.53  13416860   1421244       968
      10:50:01 PM    886972  15366280     94.54       288   4533904  13592444     83.63  13372756   1461168      1672
      11:00:01 PM    896632  15356620     94.48       288   4516256  13599060     83.67  13337820   1484964      1164
      11:10:01 PM    902312  15350940     94.45       288   4511796  13598524     83.67  13322988   1494940      1288
      11:20:01 PM    893168  15360084     94.50       288   4521292  13605668     83.71  13313952   1512680      2328
      11:30:01 PM    891516  15361736     94.51       288   4520544  13607328     83.72  13306940   1520700       780
      11:40:01 PM    968532  15284720     94.04       288   4452364  13525196     83.22  13293896   1457212       740
      11:50:02 PM    958824  15294428     94.10       288   4461344  13527452     83.23  13301956   1459784      2316
      12:00:01 AM    941672  15311580     94.21       288   4478684  13534156     83.27  13307012   1471112      2264
      12:10:01 AM    923252  15330000     94.32       288   4487852  13542452     83.32  13323488   1472880      1180
      12:20:01 AM    906868  15346384     94.42       288   4496468  13552036     83.38  13337404   1475744       824
      12:30:01 AM    903196  15350056     94.44       288   4503456  13549280     83.36  13340248   1476536      1020
      12:40:01 AM    921376  15331876     94.33       288   4507356  13526940     83.23  13317556   1481664      1104
      12:50:01 AM   1959584  14293668     87.94       288   3465108  13542232     83.32  12132192   1632860      1100
      01:00:01 AM   1944440  14308812     88.04       288   3473864  13545152     83.34  12148120   1634876       976
      01:10:01 AM   1940304  14312948     88.06       288   3481492  13542556     83.32  12151340   1635924       536
      ...
      08:00:02 AM   1828020  14425232     88.75       288    932732  16350408    100.60  13643748    359984       704
      08:10:02 AM   1774108  14479144     89.08       288    931820  16385972    100.82  13695720    361744       856
      08:20:01 AM   1753800  14499452     89.21       288    945932  16373020    100.74  13695572    369216        52
      08:30:01 AM   1318520  14934732     91.89       288    979276  17004900    104.62  14131260    370928      4712
      08:40:01 AM    736528  15516724     95.47       288    991848  17565704    108.08  14709300    373320       104
      ...
      11:00:01 AM   1369100  14884152     91.58       288   1186288  17055048    104.93  14033856    415840       104
      11:10:01 AM   4155456  12097796     74.43       288    933668  14443288     88.86  11329612    361612       152
      11:20:01 AM    545004  15708248     96.65       288    955972  18021136    110.88  14916812    374708       144
      11:30:01 AM   3329732  12923520     79.51       288    952264  15307972     94.18  12133280    381240       120
      11:40:02 AM    194940  16058312     98.80       288    934724  18386596    113.13  15295636    347148       388
      11:50:01 AM   1423420  14829832     91.24       288    948176  17187260    105.75  14040616    375852       144
      12:00:01 PM   1532760  14720492     90.57       288   1008584  16963440    104.37  13894756    408048       140
      12:10:01 PM   4232940  12020312     73.96       288   1060424  13882564     85.41  11140184    478132       228
      12:20:01 PM   1541036  14712216     90.52       288   1107108  16787496    103.29  13798968    495720       372
      12:30:01 PM    198504  16054748     98.78       288   1129656  18081600    111.25  15172360    463948       168
      ```
    
    
    - A helpful thing to check is also the trend in SAR activity. All of these `sar` commands are on data from the 31st of the month. We can go back to the 1st to see what the system usage was like on that day:
    
    
      ```
      $ sar -r -f /var/log/sa/sa01
      06:20:01 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
      06:30:01 PM   5341184  10912068     67.14       288   2529356  11486444     70.67   9601592    861944       228
      06:40:01 PM   5333916  10919336     67.18       288   2536660  11486548     70.67   9607072    863576       384
      06:50:01 PM   5326768  10926484     67.23       288   2544192  11486400     70.67   9612740    865216       364
      07:00:01 PM   5299832  10953420     67.39       288   2559332  11508340     70.81   9632088    874968       524
      07:10:01 PM   5302428  10950824     67.38       288   2566568  11495836     70.73   9624992    876528       356
      07:20:01 PM   5297108  10956144     67.41       288   2573488  11495436     70.73   9629096    878168       352
      ```
    
  • Lastly, check the amount of memory on the system, and how much is free/available:

    $ grep MemTotal /proc/meminfo 
    MemTotal:       16199236 kB
    
    $ free -m
                       total        used        free      shared  buff/cache   available
    Mem:          15819       15026         174          92         619         134
    Swap:         0       0       0
    
  • To diagnose a hardware problem with memory:

  • The dmidecode -t memory command can be used to show the memory information of the system:

#  dmidecode -t memory
Handle 0x01A2, DMI type 16, 15 bytes
Physical Memory Array
        Location: System Board Or Motherboard
        Use: System Memory
        Error Correction Type: None
        Maximum Capacity: 1 TB
        Error Information Handle: Not Provided
        Number Of Devices: 64

Handle 0x01A3, DMI type 17, 27 bytes
Memory Device
        Array Handle: 0x01A2
        Error Information Handle: No Error
        Total Width: 32 bits
        Data Width: 32 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: None
        Locator: RAM slot #0
        Bank Locator: RAM slot #0
        Type: DRAM
        Type Detail: EDO
        Speed: Unknown
        Manufacturer: Not Specified
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified

Handle 0x01A4, DMI type 17, 27 bytes
Memory Device
        Array Handle: 0x01A2
        Error Information Handle: No Error
        Total Width: Unknown
        Data Width: Unknown
        Size: No Module Installed
        Form Factor: DIMM
        Set: None
        Locator: RAM slot #1
        Bank Locator: RAM slot #1
        Type: DRAM
        Type Detail: Unknown
        Speed: Unknown
        Manufacturer: Not Specified
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified
...

Summary

  • So to summarize from the above data, there's consistently high memory utilization and commit %, and the kernel is killing processes very frequently. Normally when seeing OOM killer messages, it's prudent to recommend alternatives than to just "add more RAM", but in this case it appears the system is undersized and the amount of performance gained by tuning won't offset the memory starvation. Looking at the sar data from a month past, there is still high memory commit % (70%, not 95->100%). Therefore the high memory use looks due to a trend of increasing usage, not a memory leak or service addition.

Further tuning

SBR
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.