Improving OpenJDK and Oracle JDK Garbage Collection Performance

Updated 30 Jan 2020

OVERVIEW
Application performance is determined by measuring the time it takes to complete a given task, which is the sum of the processing and blocking times for all systems and networks involved in the transaction. With Java applications, one key component of the total response time is the time the code spends executing in the Java Virtual Machine (JVM). This is often the largest component of total response time due to Java garbage collection (GC) blocking. Therefore, improving Java GC performance is often a good way to improve application performance. This document describes how to set JVM performance goals, measure JVM performance, and address two of the most common GC issues: long pauses and bad throughput.

SETTING JVM PERFORMANCE GOALS
Before you can correct a performance issue, you must diagnose the root of the problem. The most useful way to start is by setting performance goals. Once you define the most important performance characteristics for your system, you can figure out which parameters to change and how to change them. One important question to answer is whether you want to focus on minimizing application response times or maximizing throughput. Max pause and throughput are trade-offs, so you have to decide which is the higher priority. Application requirements largely determine whether it is preferable to have more frequent collections of a shorter duration or less frequent collections that last longer.

MEASURING PERFORMANCE
Java GC is accomplished by garbage collectors (techniques) that sort through the objects in the JVM and determine which ones are live and which ones are no longer referenced so the memory can be reclaimed. OpenJDK and Oracle JDK both use generational collectors. This means there are two distinct regions within the heap: one for new objects (young generation space) and one for older objects (old generation space). In addition, there is a separate memory space called the permanent generation where class definitions, static member variables, and string interns are stored. When a generation fills up, the JVM will do a garbage collection in an attempt to free space.

The JVM uses different collectors, some with multiple phases. There are some phases in the collectors with stop-the-world collections (all threads in the JVM are stopped to perform the collection). However, not every phase of a collector is blocking. Some phases are concurrent, meaning that all threads in the JVM arerunning at the same time.

GC can cause two different types of issues: long pauses and bad throughput. Acceptable maximum pause (max pause) depends on system requirements, but 5-10 seconds seems to be a common range for this setting. Throughput is the percentage of time spent on tasks other than GC versus total time (100% means there was no GC and all time was spent running application threads; 0% means all time was spent doing GC). Throughput of 95% and above is generally considered good.

Garbage Collection Analysis
The Content from github.com is not included.garbagecat tool can be used to analyze GC logs for OpenJDK and Oracle JDK (JDK 1.5 and later). It differs from other tools in that it is able to handle complex collector events, such as concurrent mode failures, and is able to parse GC data from mixed logging. In addition, it goes beyond the simple math of calculating statistics such as max pause and throughput. It adds context to these numbers by identifying the associated collector or collector phase, which allows for much deeper insight and analysis.

Content from github.com is not included.garbagecat gives the best results when the following standard GC options are used:

-XX:+PrintGC -Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime

IMPROVING PERFORMANCE
Follow the recommendations below to address the problems of long pauses and bad throughput.

Long Pauses
There are a number of possible root causes for long pauses:

There is memory paging/swapping: there is not enough physical memory for the JVM and other services.
The collector being used is optimized for throughput, not low pauses.
The number of parallel GC threads is too high for the hardware and any virtual operating system or services running on the shared hardware.
A larger than necessary heap. Pause time is directly related to heap size, so larger heaps will generally have larger pauses.
A large heap (>2GB) without large-page support.
Explicit GC calls when the low pause collector option (-XX:+UseConcMarkSweepGC) is specified are being done with a serial collector that was not designed with multi-CPU systems in mind. This is typically quite slow.
A tool such as pstack, which causes all threads to stop until it completes, was run. If it takes a long time for pstack to run, it will pause the JVM.
The JVM initiates a full GC using the serial old collector in an attempt to free space, resulting in long GC pauses with the GC log showing multiple concurrent mode failures. To learn more about this issue, see the Red Hat Knowledgebase article at This content is not included.http://access.redhat.com/kb/docs/DOC-39687.
The Java CMS remark collection "weak refs processing" phase takes a long time. To learn more about this issue, see the Red Hat Knowledgebase article at http://access.redhat.com/kb/docs/DOC-56188.

Below are several recommended resolutions, which differ based on the root cause of the long pause.

Set Options for JBoss Enterprise Application Platform (EAP) 5.0.0

For JBoss EAP 5.0.0, try adding the -XX:+AggressiveOpts and -XX:+DoEscapeAnalysis JVM options, which internal testing has shown to be effective at improving performance.

Change the Throughput Collector

If Garbage Cat indicates that the PARALLEL_SERIAL_OLD collector is being used and there is more than one CPU, add the -XX:+UseParallelOldGC JVM option so that full collections are done with the newer parallel-throughput collector (PARALLEL_OLD_COMPACTING).

NOTE: This is typically best on multiprocessor hardware. The default collector for OpenJDK and Oracle JDK 1.5 and 1.6 is the throughput collector, which will optimize throughput at the expense of long pause times. Check to see if your performance improves while using the CMS low-pause collector with -XX:+UseConcMarkSweepGC.

Enable Large-Page Support

If the heap is large (>2GB), see if performance is improved by enabling large-page support.

Enabling large-page support typically has one of the biggest impacts on improved performance with large heaps. In Red Hat Enterprise Linux, the mapping of physical memory frames and virtual memory pages is stored in the Page Table. Page Table lookups are very expensive, so the information is cached in the Translation Lookaside Buffer (TLB). A virtual memory address is first looked up in the TLB to find the corresponding physical memory address, and if it is not found (a TBL miss), then the expensive Page Table lookup is done. When page sizes are large, a single TLB entry can span a larger memory area, resulting in reduced TLB misses. Also, large-page shared memory is locked into memory and cannot be swapped to disk, so it guards against swapping, which is unacceptable for Java application performance.

However, there are trade-offs to consider:

Large pages can sometimes negatively impact system performance. For example, when a large amount of memory is pinned by an application, it can create a shortage of physical memory and cause excessive paging in other applications, slowing the entire system.
For a system that has been up for a long time, excessive fragmentation can make it impossible to reserve enough large page memory, and the operating system might revert to using regular pages. This effect can be minimized by allocating the entire heap on startup (set -Xms = -Xmx, -XX:PermSize = -XX:MaxPermSize and -XX:InitialCodeCacheSize = -XX:ReserverCodeCacheSize).
The default sizes of the permanent generation or code cache might be larger as a result of using a large page if the large-page size is larger than the default sizes for these memory areas.

To enable large-page support, you must first enable support in your operating system (see hugetlbpage.txt in the Red Hat Enterprise Linux kernel-docs package). Once that is complete, add the -XX:+UseLargePages JVM option and set the large page size (for example, -XX:LargePageSizeInBytes=4m).

Decrease the Number of Parallel Garbage Collectors

Verify that the number of CPUs available to the JVM does not conflict with the default -XX:ParallelGCThreads setting. The default value for -XX:+UseParallelGC is to set the number of threads based on the total number of hardware threads (cpus). For -XX:+UseConcMarkSweepGC and -XX:+UseParNewGC , the default is computed as follows: (ncpus <= 8)? ncpus: 3 + ((ncpus * 5) / 8) (where ncpus is the number of cores on the platform or in the pset of the JVM set, if bound to a pset). If -XX:ParallelGCThreads is set too high, the threads will compete with one another and affect performance.

Decrease the number of parallel garbage collectors with the -XX:ParallelGCThreads JVM option:

If the JVM is running in a virtual environment, explicitly set -XX:ParallelGCThreads based on the number of CPUs assigned to the guest operating system.
If the JVM is not running in a dedicated environment, then -XX:ParallelGCThreads might have to be adjusted to account for the CPUs used by the collocated services.

Specify GC Calls to Run Concurrently

Check to see if the low pause collector is specified (-XX:+UseConcMarkSweepGC) and if the long pauses are of the following format:

80361.932: [Full GC (System) 80361.940: [CMS: 5612504K->231062K(7626048K), 24.5546875 secs] 5710661K->231062K(8312384K), [CMS Perm : 94248K->91371K(131072K)], 24.5555401 secs] [Times: user=24.54 sys=0.04, real=24.56 secs]

This indicates a serial collector, which was not designed with multiprocessor systems in mind and is thus typically quite slow. This can be the result of explicit GC calls, which by default invoke a serial collector. Test adding the -XX:+ExplicitGCInvokesConcurrent JVM option in combination with -XX:+UseConcMarkSweepGC so explicit GC calls from System.gc() will be done concurrently.

Or, if the application does not import or export any remote objects (for example, Entity JavaBeans), disable explicit GC altogether with -XX:+DisableExplicitGC . When disabling explicit GC, also remove the sun.rmi.dgc.client.gcInterval and sun.rmi.dgc.server.gcInterval options.

Check for Poor/Inverted Parallelism

Check for evidence of poor or inverted parallelism with the collections taking the longest time, such as the following:

[GC 239727.670: [ParNew: 2713920K->82240K(2713920K), 24.0534189 secs] 13734454K->11225032K(25083584K), 24.0541689 secs] [Times: user=3.27 sys=1.79, real=24.05 secs]

Note that the actual time for the collection is much larger than the user time: 24.05 seconds versus 3.27 seconds. This means that GC threads used 3.27 seconds of processor time in user space over an actual time period of 24.05 seconds. It should be just the opposite. For example, with -XX:ParallelGCThreads=6 and ideal parallelism, the user space time would be six times the actual time. In this case, not only is the parallelism poor, but it is very poor in the wrong direction.

This could be an indication of swapping. In virtual environments, there can sometimes be plenty of underlying physical memory and even memory provisioned to the virtual operating system, yet the memory that can be allocated is limited. To prevent swapping, you should increase the size of the memory allocated to the virtual machine (within the size of your physical RAM).

Inverted parallelism can also be caused by non-uniform memory access (NUMA) issues. If zone reclaim mode is enabled, the system might be trying to force-fit memory allocations within a single NUMA node. Check to see if NUMA zone reclaim mode is enabled (cat /proc/sys/vm/zone_reclaim_mode on Red Hat Enterprise Linux), and if it is set to a non-zero value, test with it disabled (set /proc/sys/vm/zone_reclaim_mode to 0). For more details, see the vm.txt file in the kernel-docs package.

Another cause of inverted parallelism is CPU saturation. Check to see if the JVM is competing with itself for CPU by checking if the number of GC threads is appropriate for the number of cpu/cores and any processes sharing CPU, and if not, use -XX:ParallelGCThreads= to set it appropriately. Perform general high CPU troubleshooting to determine if the JVM is competing for CPU with other processes.

Enable Compressed Object Pointers

Beginning with OpenJDK 8 and Oracle JDK 1.6 update 14 , add the -XX:+UseCompressedOops JVM option with heap sizes less than 32GB to enable compressed object pointers. This helps prevent 64-bit object bloat by encoding object references as 32-bit values.

NOTE: This option is enabled by default starting in JDK 1.6 update 23. Using this option prior to JDK 1.6 update 21 is known to cause the JVM to crash in some cases.

Bad Throughput

If overall throughput is at 80% or less or there are stretches when almost all time is spent doing GC, one full GC after another, you have a throughput issue. You should be sure to resolve long GC pauses before troubleshooting low throughput as long pauses will negatively affect throughput. Bad throughput has three main causes:

The permanent generation is full and triggering full collections in an attempt to free space.
Ergonomics (for example, -XX:MaxGCPauseMillis) is being used to set an unrealistic goal for the hardware and/or load.
The heap is full of long-lived objects and the JVM is trying repeatedly to free space.

Below are recommended resolutions, which differ based on the root cause of the bad throughput.

Decrease Permanent Generation Usage or Increase Permanent Generation Size

OpenJDK and Sun JDK have a dedicated permanent generation space. You might encounter an error if the size of the data stored in the dedicated space exceeds the allocated size, either due to unintended retention or because the permanent generation space is too small.

If the permanent generation is too small, increase its size:

-XX:PermSize=256M -XX:MaxPermSize=256M

If the issue is related to hot deployment/redeployment, consider changing practices to avoid hot deployment/redeployment. If hot deployment/redeployment is unavoidable, schedule a subsequent restart of the application during off hours to release any leaked class loaders. If you are using the CMS low pause collector, enable class unloading and permanent generation collections by adding the following JVM options:

JDK 1.5 -XX:+CMSPermGenSweepingEnabled -XX:+CMSClassUnloadingEnabled
JDK 1.6, 1.7 -XX:+CMSClassUnloadingEnabled

If the issue is dynamic proxies, either do not use a proxy (make direct JMX calls) or change the code to decrease proxy use by caching it so that it is only created once and is reused for future calls. To learn more about this issue, see the Red Hat Knowledgebase article at This content is not included.This content is not included.http://access.redhat.com/kb/docs/DOC-38115.

Remove Ergonomic Options

If ergonomics (for example, -XX:MaxGCPauseMillis) is being used, test removing the ergonomic options and running the JVM unconstrained.

Increase the Heap Size or Decrease Retention

A Java OutOfMemoryError can be thrown if the heap is too small for an application’s requirements or if there is unintended object retention. If the heap is too small to support the application and use case, increase the maximum heap size with the -Xmx JVM option.

If there is unintended object retention, update the code to eliminate or decrease the retention. To learn more about this issue, see the Red Hat Knowledgebase article at This content is not included.This content is not included.http://access.redhat.com/kb/docs/DOC-37057.

Category

Performance tune

Tags

Article Type

General