OpenShift 4 node with cgroup out of memory and oom-kill errors

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4
  • cgroup
  • OOM

Issue

  • OpenShift 4 node is going to NotReady state and is getting unresponsive.
  • There are "Memory cgroup out of memory" and "oom-kill" messages in the node.

Resolution

Increasing the memory resource limit for the failing pod(s) is required. Check the Diagnostic Steps section for identifying the affected pods.

Root Cause

Memory cgroup OOM messages for containers reaching the configured limit. When this occurs, other processes from the pod could become uninterruptible sleep processes (D state processes).

Diagnostic Steps

  • Messages like the following ones can be seen in the journal:

    $ journalctl --no-pager
    [...]
    Nov 07 17:01:42 a-b-c.dmz kernel: js invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-998
    Nov 07 17:01:42 a-b-c.dmz kernel: js cpuset=crio-[container_id].scope mems_allowed=0-1
    Nov 07 17:01:42 a-b-c.dmz kernel: CPU: 25 PID: 1117256 Comm: js Tainted: G        W        --------- -  - 4.18.0-193.23.1.el8_2.x86_64 #1
    Nov 07 17:01:42 a-b-c.dmz kernel: Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 03/09/2020
    Nov 07 17:01:42 a-b-c.dmz kernel: Call Trace:
    Nov 07 17:01:42 a-b-c.dmz kernel:  dump_stack+0x5c/0x80
    Nov 07 17:01:42 a-b-c.dmz kernel:  dump_header+0x6e/0x27a
    Nov 07 17:01:42 a-b-c.dmz kernel:  oom_kill_process.cold.28+0xb/0x10
    Nov 07 17:01:42 a-b-c.dmz kernel:  out_of_memory+0x1ba/0x490
    Nov 07 17:01:42 a-b-c.dmz kernel:  mem_cgroup_out_of_memory+0x49/0x80
    Nov 07 17:01:42 a-b-c.dmz kernel:  try_charge+0x6fa/0x780
    Nov 07 17:01:42 a-b-c.dmz kernel:  ? __alloc_pages_nodemask+0xef/0x280
    Nov 07 17:01:42 a-b-c.dmz kernel:  mem_cgroup_try_charge+0x8b/0x1a0
    Nov 07 17:01:42 a-b-c.dmz kernel:  mem_cgroup_try_charge_delay+0x1c/0x40
    Nov 07 17:01:42 a-b-c.dmz kernel:  do_anonymous_page+0xb5/0x360
    Nov 07 17:01:42 a-b-c.dmz kernel:  __handle_mm_fault+0x662/0x6a0
    Nov 07 17:01:42 a-b-c.dmz kernel:  handle_mm_fault+0xda/0x200
    Nov 07 17:01:42 a-b-c.dmz kernel:  __do_page_fault+0x22d/0x4e0
    Nov 07 17:01:42 a-b-c.dmz kernel:  do_page_fault+0x32/0x110
    Nov 07 17:01:42 a-b-c.dmz kernel:  ? page_fault+0x8/0x30
    Nov 07 17:01:42 a-b-c.dmz kernel:  page_fault+0x1e/0x30
    Nov 07 17:01:42 a-b-c.dmz kernel: RIP: 0033:0x5647d94534ed
    Nov 07 17:01:42 a-b-c.dmz kernel: Code: 89 de ba 01 00 00 00 48 c1 e3 0c 48 c1 ee 05 d3 e2 48 01 fb f7 d2 21 94 b7 a0 ff 0f 00 48 89 df be 00 10 00 00 e8 53 ac 03 00 <c7> 03 00 00 00 00 48 89 d8 48 c7 43 08 00 00 00 00 48 c7 43 18 1d
    Nov 07 17:01:42 a-b-c.dmz kernel: RSP: 002b:00007f48f2b77020 EFLAGS: 00010282
    Nov 07 17:01:42 a-b-c.dmz kernel: RAX: 0000000000000026 RBX: 00007f48f2026000 RCX: 0000000000000026
    Nov 07 17:01:42 a-b-c.dmz kernel: RDX: 00000000ffffffbf RSI: 0000000000001000 RDI: 00007f48f2026000
    Nov 07 17:01:42 a-b-c.dmz kernel: RBP: 00007f48f2b77030 R08: 00007f48f2b770e0 R09: 0000000000000026
    Nov 07 17:01:42 a-b-c.dmz kernel: R10: 0000000000000000 R11: 00005647dcef9200 R12: 0000000000000010
    Nov 07 17:01:42 a-b-c.dmz kernel: R13: 00005647dcb80000 R14: 00007f48f2b770e0 R15: 0000000000000010
    Nov 07 17:01:42 a-b-c.dmz kernel: Task in /kubepods.slice/kubepods-podxxxxxxxx_xxxx_xxxx_xxxx_xxxxxxxxxxxx.slice/crio-[container_id].scope killed as a result of limit of /kubepods.slice/kubepods-podxxxxxxxx_xxxx_xxxx_xxxx_xxxxxxxxxxxx.slice
    Nov 07 17:01:42 a-b-c.dmz kernel: memory: usage 131072kB, limit 131072kB, failcnt 19155420
    Nov 07 17:01:42 a-b-c.dmz kernel: memory+swap: usage 131072kB, limit 9007199254740988kB, failcnt 0
    Nov 07 17:01:42 a-b-c.dmz kernel: kmem: usage 34884kB, limit 9007199254740988kB, failcnt 0
    [...]
    Nov 07 17:11:21 a-b-c.dmz kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Nov 07 17:11:21 a-b-c.dmz kernel: mongod          D    0 1135679 1135509 0x00004084
    Nov 07 17:11:21 a-b-c.dmz kernel: Call Trace:
    Nov 07 17:11:21 a-b-c.dmz kernel:  ? __schedule+0x24f/0x650
    Nov 07 17:11:21 a-b-c.dmz kernel:  schedule+0x2f/0xa0
    Nov 07 17:11:21 a-b-c.dmz kernel:  io_schedule+0x12/0x40
    Nov 07 17:11:21 a-b-c.dmz kernel:  wait_on_page_bit_common+0x13d/0x340
    Nov 07 17:11:21 a-b-c.dmz kernel:  ? page_cache_tree_insert+0xe0/0xe0
    Nov 07 17:11:21 a-b-c.dmz kernel:  __filemap_fdatawait_range+0xe4/0x140
    Nov 07 17:11:21 a-b-c.dmz kernel:  filemap_write_and_wait_range+0x45/0x80
    Nov 07 17:11:21 a-b-c.dmz kernel:  nfs_file_fsync+0x44/0x1e0 [nfs]
    Nov 07 17:11:21 a-b-c.dmz kernel:  do_fsync+0x38/0x70
    Nov 07 17:11:21 a-b-c.dmz kernel:  __x64_sys_fdatasync+0x13/0x20
    Nov 07 17:11:21 a-b-c.dmz kernel:  do_syscall_64+0x5b/0x1a0
    Nov 07 17:11:21 a-b-c.dmz kernel:  entry_SYSCALL_64_after_hwframe+0x65/0xca
    Nov 07 17:11:21 a-b-c.dmz kernel: RIP: 0033:0x7f3018bf02e7
    Nov 07 17:11:21 a-b-c.dmz kernel: Code: Bad RIP value.
    Nov 07 17:11:21 a-b-c.dmz kernel: RSP: 002b:00007ffd871aab00 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
    Nov 07 17:11:21 a-b-c.dmz kernel: RAX: ffffffffffffffda RBX: 0000000000000011 RCX: 00007f3018bf02e7
    Nov 07 17:11:21 a-b-c.dmz kernel: RDX: 0000000000000000 RSI: 0000000000000011 RDI: 0000000000000011
    Nov 07 17:11:21 a-b-c.dmz kernel: RBP: 00007ffd871aab40 R08: 0000000000000000 R09: 0000000000000000
    Nov 07 17:11:21 a-b-c.dmz kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 0000555c46ecae80
    Nov 07 17:11:21 a-b-c.dmz kernel: R13: 0000555c43e7124f R14: 0000555c4706c000 R15: 0000000000000000
    [...]
    Nov 07 17:11:21 a-b-c.dmz kernel: INFO: task js:1172585 blocked for more than 120 seconds.
    Nov 07 17:11:21 a-b-c.dmz kernel:       Tainted: G        W        --------- -  - 4.18.0-193.23.1.el8_2.x86_64 #1
    Nov 07 17:11:21 a-b-c.dmz kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Nov 07 17:11:21 a-b-c.dmz kernel: js              D    0 1172585      1 0x00000084
    Nov 07 17:11:21 a-b-c.dmz kernel: Call Trace:
    Nov 07 17:11:21 a-b-c.dmz kernel:  ? __schedule+0x24f/0x650
    Nov 07 17:11:21 a-b-c.dmz kernel:  ? __switch_to_asm+0x35/0x70
    Nov 07 17:11:21 a-b-c.dmz kernel:  schedule+0x2f/0xa0
    Nov 07 17:11:21 a-b-c.dmz kernel:  io_schedule+0x12/0x40
    Nov 07 17:11:21 a-b-c.dmz kernel:  wait_on_page_bit+0x137/0x230
    Nov 07 17:11:21 a-b-c.dmz kernel:  ? page_cache_tree_insert+0xe0/0xe0
    Nov 07 17:11:21 a-b-c.dmz kernel:  shrink_page_list+0xc03/0xc60
    Nov 07 17:11:21 a-b-c.dmz kernel:  shrink_inactive_list+0x207/0x590
    Nov 07 17:11:21 a-b-c.dmz kernel:  shrink_node_memcg+0x1fd/0x7a0
    Nov 07 17:11:21 a-b-c.dmz kernel:  shrink_node+0xce/0x3f0
    Nov 07 17:11:21 a-b-c.dmz kernel:  do_try_to_free_pages+0xc3/0x360
    Nov 07 17:11:21 a-b-c.dmz kernel:  try_to_free_mem_cgroup_pages+0xf9/0x210
    Nov 07 17:11:21 a-b-c.dmz kernel:  try_charge+0x192/0x780
    Nov 07 17:11:21 a-b-c.dmz kernel:  ? __alloc_pages_nodemask+0xef/0x280
    Nov 07 17:11:21 a-b-c.dmz kernel:  mem_cgroup_try_charge+0x8b/0x1a0
    Nov 07 17:11:21 a-b-c.dmz kernel:  mem_cgroup_try_charge_delay+0x1c/0x40
    Nov 07 17:11:21 a-b-c.dmz kernel:  do_anonymous_page+0xb5/0x360
    Nov 07 17:11:21 a-b-c.dmz kernel:  __handle_mm_fault+0x662/0x6a0
    Nov 07 17:11:21 a-b-c.dmz kernel:  handle_mm_fault+0xda/0x200
    Nov 07 17:11:21 a-b-c.dmz kernel:  __do_page_fault+0x22d/0x4e0
    Nov 07 17:11:21 a-b-c.dmz kernel:  do_page_fault+0x32/0x110
    Nov 07 17:11:21 a-b-c.dmz kernel:  ? page_fault+0x8/0x30
    Nov 07 17:11:21 a-b-c.dmz kernel:  page_fault+0x1e/0x30
    Nov 07 17:11:22 a-b-c.dmz kernel: RIP: 0033:0x55f3831f6c04
    Nov 07 17:11:22 a-b-c.dmz kernel: Code: Bad RIP value.
    Nov 07 17:11:22 a-b-c.dmz kernel: RSP: 002b:00007feae3369220 EFLAGS: 00010202
    Nov 07 17:11:22 a-b-c.dmz kernel: RAX: 000000000000006e RBX: 000055f387599000 RCX: 0000000000009801
    Nov 07 17:11:22 a-b-c.dmz kernel: RDX: 000000000001f37c RSI: 000055f38714b190 RDI: 0000000000009800
    Nov 07 17:11:22 a-b-c.dmz kernel: RBP: 00007feae3369270 R08: 000055f3867e6100 R09: 000055f383ecb840
    Nov 07 17:11:22 a-b-c.dmz kernel: R10: 000000000001f37c R11: 000000000000f9bd R12: 000000000000f9bd
    Nov 07 17:11:22 a-b-c.dmz kernel: R13: 000055f383ecb840 R14: 000055f387149000 R15: 00007feae3369288
    [...]
    Nov 07 17:13:24 a-b-c.dmz kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Nov 07 17:13:24 a-b-c.dmz kernel: postgres        D    0 1131146 1130374 0x00004084
    Nov 07 17:13:24 a-b-c.dmz kernel: Call Trace:
    Nov 07 17:13:24 a-b-c.dmz kernel:  ? __schedule+0x24f/0x650
    Nov 07 17:13:24 a-b-c.dmz kernel:  schedule+0x2f/0xa0
    Nov 07 17:13:24 a-b-c.dmz kernel:  io_schedule+0x12/0x40
    Nov 07 17:13:24 a-b-c.dmz kernel:  wait_on_page_bit_common+0x13d/0x340
    Nov 07 17:13:24 a-b-c.dmz kernel:  ? page_cache_tree_insert+0xe0/0xe0
    Nov 07 17:13:24 a-b-c.dmz kernel:  __filemap_fdatawait_range+0xe4/0x140
    Nov 07 17:13:24 a-b-c.dmz kernel:  filemap_write_and_wait_range+0x45/0x80
    Nov 07 17:13:24 a-b-c.dmz kernel:  nfs_file_fsync+0x44/0x1e0 [nfs]
    Nov 07 17:13:24 a-b-c.dmz kernel:  do_fsync+0x38/0x70
    Nov 07 17:13:24 a-b-c.dmz kernel:  __x64_sys_fsync+0x10/0x20
    Nov 07 17:13:24 a-b-c.dmz kernel:  do_syscall_64+0x5b/0x1a0
    Nov 07 17:13:24 a-b-c.dmz kernel:  entry_SYSCALL_64_after_hwframe+0x65/0xca
    Nov 07 17:13:24 a-b-c.dmz kernel: RIP: 0033:0x7fde2cf57a71
    Nov 07 17:13:24 a-b-c.dmz kernel: Code: Bad RIP value.
    Nov 07 17:13:24 a-b-c.dmz kernel: RSP: 002b:00007ffdcced6398 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
    Nov 07 17:13:24 a-b-c.dmz kernel: RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fde2cf57a71
    Nov 07 17:13:24 a-b-c.dmz kernel: RDX: 00007fde292c6f10 RSI: 00007ffdcced63d0 RDI: 0000000000000004
    Nov 07 17:13:24 a-b-c.dmz kernel: RBP: 00007ffdcced8c10 R08: 0000000000000000 R09: 0000000000000000
    Nov 07 17:13:24 a-b-c.dmz kernel: R10: 0000000000000180 R11: 0000000000000246 R12: 00007ffdcced63d0
    Nov 07 17:13:24 a-b-c.dmz kernel: R13: 00007fde298525b8 R14: 00007ffdcced87d0 R15: 0000000001000000
    [...]
    

    Above messages indicate that /kubepods.slice/kubepods-podxxxxxxxx_xxxx_xxxx_xxxx_xxxxxxxxxxxx.slice reached its cgroup memory limit of 128MB and thus some tasks in that pod were getting stuck with task <process>:<pid> blocked for more than 120 seconds messages setting the task into D state (TASK_UNINTERRUPTIBLE).
    Refer to NotReady node with high load and many D state processes in OpenShift 4 for additional troubleshooting of D state processes.

  • Check for memory cgroup OOM messages :

    $ dmesg | grep -B1 "out of memory\|memory+swap:"
    

    If the affected process is from a container, the memory usage and limit will be shown in the above output, but not if the process is not from a container:

    [1000.0000] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio-[container_id].scope,mems_allowed=0-1,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podxxxxxxxx_xxxx_xxxx_xxxx_xxxxxxxxxxxx.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podxxxxxxxx_xxxx_xxxx_xxxx_xxxxxxxxxxxx.slice/crio-[container_id].scope,task=php,pid=321345,uid=0
    [1000.0000] Memory cgroup out of memory: Killed process 321345 (php) total-vm:10136868kB, anon-rss:9428580kB, file-rss:18880kB, shmem-rss:0kB, UID:0 pgtables:19608kB oom_score_adj:999
    --
    [1000.0000] memory: usage 10485760kB, limit 10485760kB, failcnt 199966222
    [1000.0000] memory+swap: usage 10485760kB, limit 9007199254740988kB, failcnt 0
    

    Note: if the affected process is not from a container, refer to Out of memory (OOM) killer in memory cgroup.

  • The following command will show the ID of the container and the number of ocurrences:

    $ dmesg | grep "oom-kill:" | awk -F ',' '{split($1,const,"="); split($3,id,"[-.]"); print const[2] " " id [2]}' | sort | uniq -c | sort -nr
         63 CONSTRAINT_MEMCG [container_id]
    

Important: for searching the [container_id], better to use only 10-15 characters and not the full string, as it is trimmed in some messages.

  • Check the pod(s) related to the above container(s):

    # crictl ps -va | grep -A6 [container_id]
    ID: [container_id]
    PodID: [pod_id]
    Name: [pod_name]
    Attempt: 0
    State: Running
    Image: image-registry.openshift-image-registry.svc:5000/[registry_namespace]/[image]
    Created: x weeks ago
    
  • Check for information about the [container_id] in the journal log:

    $ journalctl --no-pager | grep "[container_id]" | grep "Created container\|Starting container\|Started container\| OOM "
    

Note: the information about containers failing due to OOM long time ago could be no longer present, and it is better to search for latest ones.

Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.