When using the Java low pause Concurrent Mark Sweep (CMS) garbage collector "concurrent mode failure" appears in GC logs
Environment
- OpenJDK
- Oracle JDK 1.5 and later
Issue
- Long Garbage Collection (GC) pauses with GC log showing many concurrent mode failures.
- The following entries are found in the GC logs
2008-07-23-19:21:10 [GC 457890.408: [ParNew (promotion failed): 59008K->59008K(59008K), 4.7636750 secs]457895.171: [CMS457896.587: [CMS-concurrent-mark: 1.593/6.359 secs] [Times: user=6.36 sys=3.50, real=6.36 secs] (concurrent mode failure): 1376854K->894596K(1507328K), 8.3823060 secs] 1430906K->894596K(1566336K), [CMS Perm : 84066K->84063K(131072K)], 13.1464100 secs] [Times: user=12.90 sys=3.48, real=13.14 secs] Total time for which application threads were stopped: 13.1475330 seconds
Resolution
-
See jvmconfig, a Red Hat Access Labs app, for an interactive way to generate an optimized configuration for your environment.
-
Increase the heap size. This will help prevent premature promotion from the young to the old generation and also help fragmentation. The trade off is that collection time is directly related to heap size.
-
To prevent premature promotion from the young to the old generation, test adding the following JVM option:
-XX:MaxTenuringThreshold=32for java 7 or earlier. Or-XX:MaxTenuringThreshold=15for Java 8, which has a different range.
Note: In some older versions of the Oracle JDK this option may need to come after the-XX:+UseConcMarkSweepGCoption to get picked up; however, ordering does not matter with the latest Oracle JDK and OpenJDK. -
To prevent premature promotion from the young generation, test decreasing
-XX:SurvivorRatio(the size of the Eden space compared to one survivor space) from its default value of 1024 to somewhere in the range of 8 (for smaller young generations on the order of 10MB) to 32 (for larger young generations 100MB or larger). For example:
-XX:SurvivorRatio=32 -
To prevent premature promotion from the young to the old generation, test setting the TargetSurvivorRatio (the desired percentage of survivor space occupied after a scavenge) to a higher value than the default 50% to allow better utilization of the survivor space. For example,
-XX:TargetSurvivorRatio=90 -
To prevent premature promotion from the young to the old generation, increase the size of the young generation. The young generation is typically 1/3 to 1/4 the heap size. However, be careful not to make the young generation too large compared to the old generation, especially with JDK 1.5, which computes the promotion guarantee based on the young generation size, not historical promotion statistics.
-
To account for sudden changes in application object allocation rates, test setting the CMSInitiatingOccupancyFraction option to a value lower than the 92% default value and forcing the JVM to use strictly just this initiating occupancy fraction to determine when to start CMS collections. This is the tenured generation threshold that triggers a concurrent collection. For example:
-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=85 -
To account for large variances in application object allocation rates, test the -XX:CMSIncrementalSafetyFactor=NN (default 10) option to start the concurrent collection NN% sooner than the calculated time. For example:
-XX:CMSIncrementalSafetyFactor=20 -
To prevent fragmentation, increase the size of the heap. Or consider using the G1 collector instead on Java 7. This is the successor to the CMS collector and helps improve CMS's most notable issues with fragmentation.
-
It doesn't necessarily require dramatic changes to eliminate the concurrent mode failures. For example, it may only require increasing the new generation size and lowering survivor ratio (i.e. increasing the amount of new generation space dedicated to the survivor spaces).
Root Cause
The JVM will initiate a full GC using the serial old collector in an attempt to free up space in each of the following cases:
- The mostly concurrent collection (initial mark-->concurrent marking-->remark-->concurrent sweeping) of the old and permanent generations did not finish before either the old generation or permanent generation became full. The CMS collector measures the rate at which the the old generation is filling and the amount of time between collections and uses this historical data to calculate when to start the concurrent collection (plus adds some padding) so that it will finish just in time before the old or permanent generation becomes full.
- An allocation (e.g. promotion from the new generation) cannot be satisfied with the available free space in the old generation. The CMS collector is not a compacting collector. It discovers garbage and adds the memory to free lists of available space that it maintains based on popular object size. If many objects of varying sizes are allocated, the free lists will be split. This can lead to many free lists whose total size is large enough to handle an allocation but no one free list that is large enough.
Possible causes:
-
There is a change in application behavior (e.g. a load increase) that causes the young promotion rate to exceed historical data.
-
The application has large variances in object allocation rates, causing large variances in young generation promotion rates, leading to the CMS collector not being able to accurately predict the time between collections.
-
There is premature promotion from the young to the old generation, causing the old generation to fill with short lived objects.
-
The combination of heap size and object graph and lifetimes causes fragmentation.
-
There appears to be some issue with a Seam on EAP 5.0.0 where the same application that runs fine on JBoss AS 4.2.2 Seam results in concurrent mode failures and very poor performance on EAP 5.0.0 Seam.
Diagnostic Steps
-
Obtain GC logs, boot.log and current JVM settings
-
Note the size of the new generation compared to the amount of free space in the old generation. If the old generation free space is much larger than the new generation size the issue is fragmentation. The CMS collector is not a compacting collector. It maintains free lists of available space; it discovers garbage and adds the memory to free lists. If you allocate a lot of objects of varying sizes it has to split the free lists. This can lead to many free lists whose total size is large enough to handle the young generation promotion, but no one free list that is large enough to accomodate the young generation promotion. The young generation promotion apparently does not need a single, contigous block of old generation space but rather copies objects to the free lists.
-
Note the the amount of old generation space reclaimed by the full GC. If a lot of space is reclaimed, the issue could be premature promotion from the young to the tenured generation.
-
Note the total amount of space reclaimed by the full GC. If not much space is reclaimed the issue is likely an undersized heap or unintended object retention.
-
Note the pattern of when the concurrent mode failures happen. If they happen randomly, it could be an indication the application has random object allocation rates that inhibit the JVM from selecting good concurrent collection times. If the concurrent mode failures happen at a specific time and the JVM recovers, it could be an indication that the application behavior changed (e.g. due to load) at that time and the JVM needed time to adjust to the new allocation rate.
-
Check to see if the CMS collector is being run in incremental node with the
-XX:+CMSIncrementalModeJVM option. The-XX:CMSInitiatingOccupancyFraction=noption is ignored in combination with-XX:+CMSIncrementalMode. -
Check if it is a Seam application running on EAP 5.0.0.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.