SystemMemoryExceedsReservation alert received in OCP 4
Environment
- Red Hat Openshift Container Platform (RHOCP)
- 4
Issue
- The monitoring stack is firing the alert
SystemMemoryExceedsReservation, however, affected nodes don't have memory pressure.
Resolution
The default memory reservation (which is 1GB memory) is expected to be sufficient for most configurations, and there was a fix in older releases to prevent false positives during startup. But for nodes with high amount of memory, or if the alert is still firing in recent versions, it will be needed to increase the reservations.
Increasing the reservations
For increasing the reservations, check the resources used in the nodes as explained in the "Diagnostic Steps" section, to know if the alert is a false positive or not. After checking that, refer to which amount of CPU and memory are recommended to reserve for the system in OCP 4 nodes? to calculate the recommended values for the systemReserved and configure them.
Note: there is a bug described in high system‑reserved cpu usage when
autoSizingReserved: trueis enabled that reserves low CPU when using the auto reservation. If using an affected version, it will be needed to upgrade to a version that includes the fix, or to configure manual reservation.
Root Cause
The SystemMemoryExceedsReservation alert (introduced in OpenShift 4.6) is a warning triggered when the memory usage of the system processes exceeds the 95% of the reservation, not the total memory in the node.
Reserved memory ensures system processes can function even when the node is fully allocated and protects against workload out of memory events impacting the proper functioning of the node. The default memory reservation (which is 1GB memory) is expected to be sufficient for most configurations and should be increased when running nodes with high numbers of pods (either due to rate of change or at steady state), or in nodes with high memory.
Understanding the alert trigger
The Prometheus query used by the alert looks like this:
sum by (node)
(container_memory_rss{id="/system.slice"}) >
((sum by (node) (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"})) * 0.9)
-
The right side of the check:
"((sum by (node) (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"})) * 0.9)"Is a static value, indicating the available memory that remains for your system processes.
-
The left hand of the check:
"sum by (node) (container_memory_rss{id="/system.slice"})"Shows how much working memory are using system processes related to containers running inside the node.
Based on the above, the alert is a warning in case the node fills up all its allocatable memory, so there will not be enough memory left to satisfy system processes for containers, however, it does not mean that the node is resource exhausted right now.
Diagnostic Steps
-
Check the list of processes included in the resource reservation:
$ oc debug node/[node_name] [...] sh-4.4# chroot /host bash # systemd-cgls /system.slice/ -
Check the reserved values configured in the nodes by checking the
/etc/node-sizing.envfile within the nodes:$ for node in $(oc get node -o name); do echo "--- ${node} ---"; oc debug -q ${node} -- cat /host/etc/node-sizing.env; done -
Check the resources used in the nodes (check the values used by
kubeletandruntimeif you want to compare with the configuredsystemReservedvalues):$ oc get --raw /api/v1/nodes/<node>/proxy/stats/summary [...] { "node": { "nodeName": "cluster.node22", "systemContainers": [ { "cpu": { "usageCoreNanoSeconds": 929684480915, "usageNanoCores": 190998084 }, "memory": { "rssBytes": 176726016, "usageBytes": 1397895168, "workingSetBytes": 1050509312 }, "name": "kubelet" }, { "cpu": { "usageCoreNanoSeconds": 128521955903, "usageNanoCores": 5928600 }, "memory": { "rssBytes": 35958784, "usageBytes": 129671168, "workingSetBytes": 102416384 }, "name": "runtime" } [...] ] } } [...] -
Check the utilization via the Prometheus query used by the alert:
sum by(node) (container_memory_rss{id="/system.slice"}) > ((sum by(node) (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"})) * 0.95)
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.