How to troubleshoot Out of memory (OOM) killer in Red Hat Enterprise Linux
Environment
- Red Hat Enterprise Linux 8
- Red Hat Enterprise Linux 7
- Red Hat Enterprise Linux 6
- Red Hat OpenStack Platform 9.0
- Red Hat OpenStack Platform 8.0
- Red Hat OpenStack Platform 7.0
Issue
- The Out of memory (OOM) killer daemon is killing active processes.
- Services on Red Hat OpenStack Platform nodes are randomly dying.
- Memory usage seems to be high on Red Hat OpenStack Platform nodes.
- Available memory on Red Hat OpenStack Platform nodes seems to be low.
- There are
Out of memory: Kill process 43805 (keystone-all) score 249 or sacrifice childnoticed in the logs:
[Wed Aug 31 16:36:42 2016] Out of memory: Kill process 43805 (keystone-all) score 249 or sacrifice child
[Wed Aug 31 16:36:42 2016] Killed process 43805 (keystone-all) total-vm:4446352kB, anon-rss:4053140kB, file-rss:68kB
[Wed Aug 31 16:56:25 2016] keystone-all invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[Wed Aug 31 16:56:25 2016] keystone-all cpuset=/ mems_allowed=0
[Wed Aug 31 16:56:25 2016] CPU: 2 PID: 88196 Comm: keystone-all Not tainted 3.10.0-327.13.1.el7.x86_64 #1
Resolution
-
When troubleshooting why the Out of memory (OOM) killer process starts up, one must look at a few factors on the system. Generally, OOM killer starts because of a handful of reasons:
Reason | Probable Cause
-|-
1 | Spike in memory usage based on a load event (additional processes are needed for increased load).
2 | Spike in memory usage based on additional services being added or migrated to the system. (Added another app or started a new service on the system)
3 | Spike in memory usage due to failed hardware such as a DIMM memory module.
4 | Spike in memory usage due to undersizing of hardware resources for the running application(s).
5 | There's a memory leak in a running application.
- To know more about the processes consuming highest memory along with other memory allocations at the time of server experiencing Out of Memory condition please consider using the below Lab.
This content is not included.Red Hat Out Of Memory Analyzer
Mitigation
-
Alleviate the memory constraint by making additional swap memory available. This can be done by adding a swap partition or a swap file to the system. A swap partition is more preferable because it has higher performance than a swap file.
- How to create a swap partition in Red Hat Enterprise Linux?
- How to add a swap file to my Red Hat Enterprise Linux system?
- How to prioritize a swap file or partiton?
NOTE: swap memory is considerably less performant than actual RAM. These steps are not intended to be permanent, and it is not recommended to use swap as a substitute for memory. Adding swap serves to give yourself time to further troubleshoot the issue, and either free up or add additional memory to the system.
-
Additionally, one can increase the frequency of the SAR data recorded on the system. By default, it is set to gather data every 10 minutes. It can be increased to every 1 minute if desired. This can help gather more granular performance statistics to aid in troubleshooting and trend analysis.
Troubleshooting
-
Check to see how often the Out of memory (OOM) killer process is running. This will give you a gauge of how frequent the memory constraint is being hit. In this example, we're seeing OOM killer hit a process as often as every 10 minutes. This is quite frequent:
$ egrep 'Out of memory:' /var/log/messages Aug 31 10:57:27 hostname kernel: Out of memory: Kill process 121952 (glance-api) score 72 or sacrifice child Aug 31 10:57:29 hostname kernel: Out of memory: Kill process 55333 (keystone-all) score 79 or sacrifice child Aug 31 10:57:31 hostname kernel: Out of memory: Kill process 4106 (keystone-all) score 98 or sacrifice child Aug 31 10:57:34 hostname kernel: Out of memory: Kill process 54961 (keystone-all) score 130 or sacrifice child Aug 31 10:57:41 hostname kernel: Out of memory: Kill process 86345 (keystone-all) score 196 or sacrifice child Aug 31 10:58:54 hostname kernel: Out of memory: Kill process 121731 (keystone-all) score 260 or sacrifice child Aug 31 11:09:44 hostname kernel: Out of memory: Kill process 9495 (keystone-all) score 261 or sacrifice child Aug 31 11:12:15 hostname kernel: Out of memory: Kill process 23113 (keystone-all) score 261 or sacrifice child Aug 31 14:09:27 hostname kernel: Out of memory: Kill process 43547 (keystone-all) score 261 or sacrifice child Aug 31 14:21:15 hostname kernel: Out of memory: Kill process 43743 (keystone-all) score 181 or sacrifice child Aug 31 14:29:45 hostname kernel: Out of memory: Kill process 58542 (keystone-all) score 209 or sacrifice child Aug 31 14:40:27 hostname kernel: Out of memory: Kill process 43545 (keystone-all) score 105 or sacrifice child Aug 31 14:47:30 hostname kernel: Out of memory: Kill process 56050 (keystone-all) score 125 or sacrifice child Aug 31 14:55:00 hostname kernel: Out of memory: Kill process 35892 (keystone-all) score 152 or sacrifice child Aug 31 15:07:52 hostname kernel: Out of memory: Kill process 75726 (keystone-all) score 198 or sacrifice child Aug 31 15:31:16 hostname kernel: Out of memory: Kill process 43805 (keystone-all) score 249 or sacrifice child Aug 31 15:50:59 hostname kernel: Out of memory: Kill process 88196 (keystone-all) score 507 or sacrifice child Sep 2 12:35:54 hostname kernel: Out of memory: Kill process 52446 (keystone-all) score 473 or sacrifice child Sep 2 18:43:33 hostname kernel: Out of memory: Kill process 97213 (keystone-all) score 166 or sacrifice child Sep 2 18:54:48 hostname kernel: Out of memory: Kill process 31716 (keystone-all) score 178 or sacrifice child Sep 2 19:09:12 hostname kernel: Out of memory: Kill process 45010 (keystone-all) score 148 or sacrifice child Sep 2 19:21:41 hostname kernel: Out of memory: Kill process 7073 (keystone-all) score 191 or sacrifice child Sep 2 19:47:21 hostname kernel: Out of memory: Kill process 116139 (keystone-all) score 279 or sacrifice child Sep 2 20:33:12 hostname kernel: Out of memory: Kill process 108621 (keystone-all) score 460 or sacrifice child -
Check to see how large the memory consumption is of the processes being killed. Here we see the processes killed are consuming anywhere from 1-8GB of memory:
$ egrep 'total-vm' /var/log/messages Aug 31 10:57:27 hostname kernel: Killed process 121952 (glance-api) total-vm:5698360kB, anon-rss:1174988kB, file-rss:0kB Aug 31 10:57:29 hostname kernel: Killed process 55333 (keystone-all) total-vm:1681340kB, anon-rss:1292656kB, file-rss:564kB Aug 31 10:57:31 hostname kernel: Killed process 4106 (keystone-all) total-vm:1979684kB, anon-rss:1591104kB, file-rss:4kB Aug 31 10:57:34 hostname kernel: Killed process 54961 (keystone-all) total-vm:2496376kB, anon-rss:2108056kB, file-rss:568kB Aug 31 10:57:41 hostname kernel: Killed process 86345 (keystone-all) total-vm:3580248kB, anon-rss:3191708kB, file-rss:348kB Aug 31 10:58:54 hostname kernel: Killed process 121731 (keystone-all) total-vm:4620304kB, anon-rss:4231956kB, file-rss:0kB Aug 31 11:09:44 hostname kernel: Killed process 9495 (keystone-all) total-vm:4626764kB, anon-rss:4238416kB, file-rss:496kB Aug 31 11:12:15 hostname kernel: Killed process 23113 (keystone-all) total-vm:4631276kB, anon-rss:4242928kB, file-rss:0kB Aug 31 14:09:27 hostname kernel: Killed process 43547 (keystone-all) total-vm:4631904kB, anon-rss:4243560kB, file-rss:0kB Aug 31 14:21:15 hostname kernel: Killed process 43743 (keystone-all) total-vm:3336616kB, anon-rss:2943280kB, file-rss:72kB Aug 31 14:29:45 hostname kernel: Killed process 58542 (keystone-all) total-vm:3788528kB, anon-rss:3395048kB, file-rss:60kB Aug 31 14:40:27 hostname kernel: Killed process 43545 (keystone-all) total-vm:2096368kB, anon-rss:1705116kB, file-rss:0kB Aug 31 14:47:30 hostname kernel: Killed process 56050 (keystone-all) total-vm:2430728kB, anon-rss:2040104kB, file-rss:0kB Aug 31 14:55:00 hostname kernel: Killed process 35892 (keystone-all) total-vm:2866124kB, anon-rss:2473448kB, file-rss:68kB Aug 31 15:07:52 hostname kernel: Killed process 75726 (keystone-all) total-vm:3605284kB, anon-rss:3211852kB, file-rss:68kB Aug 31 15:31:16 hostname kernel: Killed process 43805 (keystone-all) total-vm:4446352kB, anon-rss:4053140kB, file-rss:68kB Aug 31 15:50:59 hostname kernel: Killed process 88196 (keystone-all) total-vm:8600652kB, anon-rss:8224512kB, file-rss:0kB -
Further analysis, we can check the system activity reporter (SAR) data to see what it's captured about the OS. In this example, we see the previous memory constraint is consistent with the system activity reporter data. Sar shows low CPU utilization (good), low IOwait (good), but very high memory utilization and commit percentages (a concern). In high usage times, the commit% is higher than 100, indicating the system is overcommitting memory and is memory starved:
- Check swap statistics with the `-S` flag: A high % of `swpused` indicates swapping and memory shortage: (these may not be present if there's no swap partition or file) ``` $ sar -S -f /var/log/sa/sa23 Linux 3.10.0-327.22.2.el7.x86_64 (hostname) 23/08/16 _x86_64_ (8 CPU) 12:00:01 AM kbswpfree kbswpused %swpused kbswpcad %swpcad 12:10:01 AM 58883728 8225132 12.26 1998956 24.30 12:20:04 AM 58864012 8244848 12.29 1249296 15.15 12:30:01 AM 58806528 8302332 12.37 1205484 14.52 12:40:02 AM 58637300 8471560 12.62 1111648 13.12 12:50:01 AM 58589064 8519796 12.70 1095492 12.86 01:00:02 AM 58540448 8568412 12.77 1097672 12.81 01:10:09 AM 58543808 8565052 12.76 1108248 12.94 01:20:01 AM 58531484 8577376 12.78 1150284 13.41 ``` - Additionally, `vmstat` can be used to live check active swapping. Values in the `si` and `so` columns indicate memory being swapped in and out: ``` $ vmstat 1 10 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 2 0 12370668 241988 0 636080 39 27 63 192 6 8 23 6 71 0 0 3 0 12369380 239904 0 635984 1588 0 1588 1444 14312 13248 25 11 64 0 0 2 0 12369376 238704 0 636264 0 0 0 48 12465 9823 24 8 68 0 0 2 0 12369376 238936 0 636236 24 0 24 1515 13101 10377 24 8 68 0 0 5 0 12354472 240280 0 634740 0 0 0 104 13935 11042 27 8 65 0 0 5 0 12353792 241768 0 634144 1308 0 1404 40 20061 9469 42 8 50 0 0 4 0 12353704 239588 0 634728 112 0 112 56 14098 10500 26 8 65 0 0 4 0 12349164 228920 0 634624 4860 0 5308 4107 15228 14442 31 10 59 1 0 5 0 12369364 261104 0 636192 68 0 68 64 13456 9894 24 9 67 0 0 2 0 12369356 262804 0 636228 32 0 96 1487 13805 10328 27 8 65 0 0 ``` - Check CPU and IOwait statistics: High `%user` or `%system` indicate a busy system, also high `%iowait` the system is spending important time waiting on underlying storage: ``` $ sar -f /var/log/sa/sa31 Linux 3.10.0-327.13.1.el7.x86_64 (hostname) 30/08/16 _x86_64_ (8 CPU) 11:00:01 PM CPU %user %nice %system %iowait %steal %idle 11:10:01 PM all 18.04 0.00 1.11 0.00 0.00 80.85 11:20:01 PM all 18.60 0.00 1.10 0.00 0.00 80.29 11:30:01 PM all 18.61 0.00 1.20 0.00 0.00 80.18 11:40:01 PM all 12.40 0.00 1.02 0.00 0.00 86.58 11:50:01 PM all 13.22 0.00 1.07 0.00 0.00 85.71 12:00:01 AM all 15.40 0.00 1.09 0.00 0.00 83.51 12:10:01 AM all 13.10 0.00 1.01 0.00 0.00 85.89 12:20:01 AM all 12.16 0.00 0.90 0.00 0.00 86.94 12:30:01 AM all 11.23 0.00 0.88 0.00 0.00 87.89 12:40:01 AM all 10.69 0.00 0.89 0.00 0.00 88.42 12:50:01 AM all 10.05 0.00 0.87 0.00 0.00 89.08 01:00:01 AM all 11.08 0.00 0.88 0.00 0.00 88.04 01:10:01 AM all 10.45 0.00 0.88 0.00 0.00 88.67 01:20:01 AM all 11.22 0.00 0.89 0.00 0.00 87.88 ``` - Check memory statistics: High `%memused` and `%commit` values tell us the system is using nearly all of its memory, and memory that is committed to processes (high `%commit` is more concerning): ``` $ sar -r -f /var/log/sa/sa31 Linux 3.10.0-327.13.1.el7.x86_64 (hostname) 30/08/16 _x86_64_ (8 CPU) 09:00:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty 09:10:01 PM 1023648 15229604 93.70 288 4610776 13528944 83.24 13580232 1120648 544 09:20:01 PM 864644 15388608 94.68 288 4584948 13720124 84.41 13617100 1240576 960 09:30:01 PM 1409040 14844212 91.33 288 4525700 13388272 82.37 13126048 1185404 776 09:40:01 PM 1059396 15193856 93.48 288 4548180 13519208 83.18 13461268 1195824 2140 09:50:01 PM 859520 15393732 94.71 288 4560488 13560828 83.43 13626728 1231500 1412 10:00:01 PM 841528 15411724 94.82 288 4581020 13569344 83.49 13600780 1274472 1212 10:10:01 PM 833488 15419764 94.87 288 4589920 13576600 83.53 13576756 1306736 1204 10:20:01 PM 835664 15417588 94.86 288 4588580 13571876 83.50 13554340 1329356 2148 10:30:01 PM 837548 15415704 94.85 288 4583996 13574336 83.52 13533756 1348220 1140 10:40:01 PM 882460 15370792 94.57 288 4546476 13576156 83.53 13416860 1421244 968 10:50:01 PM 886972 15366280 94.54 288 4533904 13592444 83.63 13372756 1461168 1672 11:00:01 PM 896632 15356620 94.48 288 4516256 13599060 83.67 13337820 1484964 1164 11:10:01 PM 902312 15350940 94.45 288 4511796 13598524 83.67 13322988 1494940 1288 11:20:01 PM 893168 15360084 94.50 288 4521292 13605668 83.71 13313952 1512680 2328 11:30:01 PM 891516 15361736 94.51 288 4520544 13607328 83.72 13306940 1520700 780 11:40:01 PM 968532 15284720 94.04 288 4452364 13525196 83.22 13293896 1457212 740 11:50:02 PM 958824 15294428 94.10 288 4461344 13527452 83.23 13301956 1459784 2316 12:00:01 AM 941672 15311580 94.21 288 4478684 13534156 83.27 13307012 1471112 2264 12:10:01 AM 923252 15330000 94.32 288 4487852 13542452 83.32 13323488 1472880 1180 12:20:01 AM 906868 15346384 94.42 288 4496468 13552036 83.38 13337404 1475744 824 12:30:01 AM 903196 15350056 94.44 288 4503456 13549280 83.36 13340248 1476536 1020 12:40:01 AM 921376 15331876 94.33 288 4507356 13526940 83.23 13317556 1481664 1104 12:50:01 AM 1959584 14293668 87.94 288 3465108 13542232 83.32 12132192 1632860 1100 01:00:01 AM 1944440 14308812 88.04 288 3473864 13545152 83.34 12148120 1634876 976 01:10:01 AM 1940304 14312948 88.06 288 3481492 13542556 83.32 12151340 1635924 536 ... 08:00:02 AM 1828020 14425232 88.75 288 932732 16350408 100.60 13643748 359984 704 08:10:02 AM 1774108 14479144 89.08 288 931820 16385972 100.82 13695720 361744 856 08:20:01 AM 1753800 14499452 89.21 288 945932 16373020 100.74 13695572 369216 52 08:30:01 AM 1318520 14934732 91.89 288 979276 17004900 104.62 14131260 370928 4712 08:40:01 AM 736528 15516724 95.47 288 991848 17565704 108.08 14709300 373320 104 ... 11:00:01 AM 1369100 14884152 91.58 288 1186288 17055048 104.93 14033856 415840 104 11:10:01 AM 4155456 12097796 74.43 288 933668 14443288 88.86 11329612 361612 152 11:20:01 AM 545004 15708248 96.65 288 955972 18021136 110.88 14916812 374708 144 11:30:01 AM 3329732 12923520 79.51 288 952264 15307972 94.18 12133280 381240 120 11:40:02 AM 194940 16058312 98.80 288 934724 18386596 113.13 15295636 347148 388 11:50:01 AM 1423420 14829832 91.24 288 948176 17187260 105.75 14040616 375852 144 12:00:01 PM 1532760 14720492 90.57 288 1008584 16963440 104.37 13894756 408048 140 12:10:01 PM 4232940 12020312 73.96 288 1060424 13882564 85.41 11140184 478132 228 12:20:01 PM 1541036 14712216 90.52 288 1107108 16787496 103.29 13798968 495720 372 12:30:01 PM 198504 16054748 98.78 288 1129656 18081600 111.25 15172360 463948 168 ``` - A helpful thing to check is also the trend in SAR activity. All of these `sar` commands are on data from the 31st of the month. We can go back to the 1st to see what the system usage was like on that day: ``` $ sar -r -f /var/log/sa/sa01 06:20:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty 06:30:01 PM 5341184 10912068 67.14 288 2529356 11486444 70.67 9601592 861944 228 06:40:01 PM 5333916 10919336 67.18 288 2536660 11486548 70.67 9607072 863576 384 06:50:01 PM 5326768 10926484 67.23 288 2544192 11486400 70.67 9612740 865216 364 07:00:01 PM 5299832 10953420 67.39 288 2559332 11508340 70.81 9632088 874968 524 07:10:01 PM 5302428 10950824 67.38 288 2566568 11495836 70.73 9624992 876528 356 07:20:01 PM 5297108 10956144 67.41 288 2573488 11495436 70.73 9629096 878168 352 ``` -
Lastly, check the amount of memory on the system, and how much is free/available:
$ grep MemTotal /proc/meminfo MemTotal: 16199236 kB$ free -m total used free shared buff/cache available Mem: 15819 15026 174 92 619 134 Swap: 0 0 0 -
To diagnose a hardware problem with memory:
-
The
dmidecode -t memorycommand can be used to show the memory information of the system:
# dmidecode -t memory
Handle 0x01A2, DMI type 16, 15 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: None
Maximum Capacity: 1 TB
Error Information Handle: Not Provided
Number Of Devices: 64
Handle 0x01A3, DMI type 17, 27 bytes
Memory Device
Array Handle: 0x01A2
Error Information Handle: No Error
Total Width: 32 bits
Data Width: 32 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: RAM slot #0
Bank Locator: RAM slot #0
Type: DRAM
Type Detail: EDO
Speed: Unknown
Manufacturer: Not Specified
Serial Number: Not Specified
Asset Tag: Not Specified
Part Number: Not Specified
Handle 0x01A4, DMI type 17, 27 bytes
Memory Device
Array Handle: 0x01A2
Error Information Handle: No Error
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: DIMM
Set: None
Locator: RAM slot #1
Bank Locator: RAM slot #1
Type: DRAM
Type Detail: Unknown
Speed: Unknown
Manufacturer: Not Specified
Serial Number: Not Specified
Asset Tag: Not Specified
Part Number: Not Specified
...
Summary
- So to summarize from the above data, there's consistently high memory utilization and commit %, and the kernel is killing processes very frequently. Normally when seeing OOM killer messages, it's prudent to recommend alternatives than to just "add more RAM", but in this case it appears the system is undersized and the amount of performance gained by tuning won't offset the memory starvation. Looking at the sar data from a month past, there is still high memory commit % (70%, not 95->100%). Therefore the high memory use looks due to a trend of increasing usage, not a memory leak or service addition.
Further tuning
- Further memory tuning can be investigated if desired:
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.