Heap dump shows memory accumulation in one instance of org.jgroups.protocols.pbcast.NAKACK

Solution Unverified - Updated 6 Aug 2024

Environment

Red Hat JBoss Enterprise Application Platform (EAP) 5.1.x

Issue

We are analysing a heap dump and we have a problem:

One instance of "org.jgroups.protocols.pbcast.NAKACK" loaded by "org.jboss.classloader.spi.base.BaseClassLoader @ 0x2aaacfe84578" occupies 609,198,944 (38.19%) bytes. The memory is accumulated in one instance of "java.util.concurrent.ConcurrentHashMap$Segment[]" loaded by "<system class loader>".

Keywords
org.jboss.classloader.spi.base.BaseClassLoader @ 0x2aaacfe84578
org.jgroups.protocols.pbcast.NAKACK
java.util.concurrent.ConcurrentHashMap$Segment[]

Can you see where the problem is ?

Heap dump shows org.jgroups.stack.NakReceiverWindow occupying a high percentage of the heap

Class Name                                                          | Shallow Heap | Retained Heap
---------------------------------------------------------------------------------------------------
org.jgroups.stack.NakReceiverWindow @ 0xccdc60b8                    |           88 |   465,927,320
|- listener org.jgroups.protocols.pbcast.NAKACK @ 0xccdc1c38        |          232 |   465,966,160
|  |- xmit_table java.util.concurrent.ConcurrentHashMap @ 0xccdc5d90|           48 |   465,932,648
---------------------------------------------------------------------------------------------------

Resolution

Capture the heap dump from the affected instance and attach it to a This content is not included.support case for further analysis
If the file is too large to be attached to a case, see How do I upload large files

Root Cause

The NakReceiverWindow buffers all sent messages, and possibly received if discard_delivered_msgs=false, until all nodes in the cluster have seen them.
The decision on the highest messages seen is done by consensus from all members, in Content from www.jgroups.org is not included.STABLE.
If there is a huge difference between the highest_received/delivered and highest_stability, then NAKACK will be seen to occupy a significant chunk of the heap.

Now there two reasons why this could happen:

There is a slow member, which slowed down agreement on delivered messages, and therefore the number of messages stored in the NakReceiverWindow was growing. Note that if that member was killed, stability would soon purge delivered messages
STABLE is misconfigured, e.g. only desired_avg_gossip is defined but not max_bytes

Diagnostic Steps

Get server.log from all cluster nodes for the time in question and analyse
- Check log for NAKACK "message not found in retransmission table" where messageID is too low
- If present, then you may be running into an outstanding issue in JGroups where sometimes when a node is kicked, it will merge back in before it realizes it's been kicked and rejoins, and merge doesn't handle that case correctly. Provide output from command _ java -jar $JBOSS_HOME/server/$PROFILE/lib/jgroups.jarpload to confirm version of jgroups currently in use.
Verify if this is the result of long garbage collection (GC) pauses . See how to enable garbage collection logging if you do not have GC logs enabled already
- From the GC logs, if long GC pauses are present, establish whether it is a symptom or cause of the problem (NAKACK retention could be the cause of the GC pauses, which wouldn't occur if NAKACK wasn't holding the memory)
Using Content from www.eclipse.org is not included.eclipse Memory Analyzer Tool (MAT) expand heap and look for IP/highest_stability_seqno/highest_received/retransmitter in the org.jgroups.stack.NakReceiverWindow
- From OQL run the command SELECT * FROM org.jgroups.stack.NakReceiverWindow and sort by heap retention and then extract the referenced attributes
- Select org.jgroups.stack.NakReceiverWindow object, occupying a high percentage of the heap, the difference between the highest_received/delivered shows how many messages are yet to be delivered

In the example below, we have 61446 in the table that could be removed when next STABLE kicks in

Type   |Name                   |Value
-------------------------------------------
long   |highest_stability_seqno|1207601
.. .. ..
long   |highest_received       |1269047
long   |highest_delivered      |1269047
-------------------------------------------

Check for network connectivity issues based on heap dump analysis

SBR

JBoss Clustering

Product(s)

Red Hat JBoss Enterprise Application Platform

Components

cluster

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.