Java's memory consumption inside a Openshift 4 container

Updated

The article Why does the JVM consume more memory than the amount given to -Xmx? covers a core detail on how Red Hat OpenJDK's (and Java 8+) in general behaves in terms of memory usage, where the heap is just part of the total JVM memory consumption and the other part is the off-heap (so called native usage).

Specifically in terms of JVM usage inside (a container in) Openshift 4 there are a few topics that are worth noting:

Reported usage versus actual usage in Openshift 4

In terms of reported usage, for container usage in OCP 4, some metrics are scraped from the nodes' metric cadvisor that are read from cgroups files, so it is expected a discrepancy between the data reported by the JVM and the cadvisor information.

In other words, any large discrepancy won't be caused by a discrepancy on Xmx vs MaxRAM by the JVM, otherwise the OOM kill would already (cgroups will enforce the container limits). The difference is usually explained because those cgroups metrics include VSS usage/cache. For example: container_memory_rss is equal to the value of total_rss from /sys/fs/cgroups/memory/memory.status file that also reports cache data, which may be twice the container size (or more), in other words itcan exceed the total amount of physical memory depending on configuration. This means filecache and any cached memory impact the reported size of the container by cadvisor.

Again, this does not mean the JVM is using more memory than the container limit, otherwise, this would directly trigger an cgroups OOMKill, similar to any process inside a container. The reported may differ from the actual usage.

Usage of JVM vs usage of other processes (or mechanisms) inside the container

In terms of actual usage inside the container (and usually pod) usually there would be only one process, the main Java process (as in the process started by run-java.sh). However, other mechanisms might impact the memory usage (directly and indirectly) and therefore can cause OOM Kills. A few scenarios are listed below:

  • For pods with multiple containers: Sidecars can play a role on the memory consumption. Make sure to track the usage of those containers.
  • Accessing a container via ssh client will have an overhead on the memory consumption of the container. Make sure the clients close the connections as well.
  • If a hypervisor is used, with overcommitting enabled, it can cause a ballooning side-effect inside the container. For instance VMware balloon driver can be responsible for memory consumption inside the container. To track this issue, it is required a vmcore or stap to be able to assert. See solutions How to find out if Vmware's Ballooning drivers are consuming memory ? and How to determine the memory allocated to the VMware Balloon driver [vmw_balloon] (shipped with RHEL) from vmcore?.
  • Any type of mechanism or process that would anonymously allocate memory on the respective container and not be limited by JVM boundaries: MaxRAM (or the highest priority Xmx heap upper limit).
  • Swap was disabled in OCP 3.11+, see details on Java Pod OOME Killer after migration from OCP 3.9 to OCP 3.11. For details on enabling in OCP 4.x see Enable swap memory of the nodes in RHOCP4.
  • Glibc's malloc and jemalloc are both not container aware. Jemalloc is an alternative C allocator like glibc's malloc. So that sits below the JVM level.
  • In JDK 17, ZGC heap is still triple-mapped, thus the heap size counts thrice in the RSS display. If the heap is fully touched, that would amount to the heap is x, so then RSS displayed will be X/3, meaning one third.

Kernel RSS reporting

The kernel's RSS accounting is not optimal and varies depending on the pages used (and it can also vary depending on which kernel version is used).

On Linux/x86_64, we map the heap in three different locations. When using small pages, you'll typically see that the same physical page will incorrectly be accounted for three times instead of once. On the other
hand, when using large pages, you'll typically see a different behavior, as it's accounted to the hugetlbfs inode and not the process.

Troubleshooting

For OOM Kills, the solution How to interpret an OOM killer message can be very useful. In active scenarios of troubleshooting, collecting the VM.info and gc logs provide the direct memory usage by Java. VM.info is particularly relevant given the memory dump details at the end of the file. Example:

### Memory hard limit, swap (memory included), and soft limit:
$ podman run --memory-reservation=700m --memory=500m --memory-swap=1000m --rm -it localhost/openjdk:app
memory_limit_in_bytes: 512000 k <------------------- memory 500mb
memory_and_swap_limit_in_bytes: 1024000 k <--------- swap and memory 1000mb
memory_soft_limit_in_bytes: 716800 k <-------------- soft limit 700mb
memory_usage_in_bytes: 120324 k
memory_max_usage_in_bytes: not supported
memory_swap_current_in_bytes: unlimited
memory_swap_max_limit_in_bytes: 512000 k
...
...
$ podman stats
ID            NAME              CPU %       MEM USAGE / LIMIT  MEM %       NET IO       BLOCK IO      PIDS        CPU TIME    AVG CPU %
e6d5ec05dd7e  vigorous_wescoff  0.20%       97.26MB / 524.3MB  18.55%      330B / 430B  0B / 720.9kB  42          6.045894s   0.69%

Details can be found in the solution Interpreting VM.info file in OpenJDK/OracleJDK.

Control groups differences: Cgropus v1 (cgv1) vs Cgroups v2 (cgv2):

OCP 4.14+ brings cgroups v2 by default - and cgv1 in case it is migrated. However, OCP 4.19 deprecates cgv1 and OCP 4.20+ removes the support for cgv1 altogether:

Cgroups versionOCP versionOutput
Cgv1Deprecated OCP 4.19. Removed: OCP 4.20In cgroupsv1 it was active. Also a file counter.
Cgv2Introduced in OCP 4.14+. Default in OCP 4.20In cgroupsv2 it reports: active and cache, this is more detailed with file_mapped

File path differences:

Cgroup v1 example file path:

From /sys/fs/cgroup/memory/memory.usage_in_bytes 

Cgroups v2 example file path (equivalent to the above):

From /sys/fs/cgroup/memory/memory.stat:
  rss → Memory actively used by your JVM process (Java heap, metaspace, threads, native buffers).
  cache → Page cache (file system cache, JARs, class files, logs). Kernel can reclaim this if needed.

If filebacked is increasing, then let the kernel reclaim it. If eventually there is an OOM we can see if filebacked is actually the issue or if anon is or something else. That's usually not the case.
The slab can be tracked, however it involves enormous overhead.
Therefore, the recommendation from Red Hat is to trying to track unless an actual problem is found.

For full comparison CGV1 vs CGV2, see Cgroups v2 in OpenJDK container in Openshift 4.

Category
Article Type