EAP 6 EJB threads bottleneck in the classloader when another nodes joins the cluster

Solution Verified - Updated

Environment

  • Red Hat JBoss Enterprise Application Platform (EAP) 6.x
  • Oracle JDK 1.7 , OpenJDK 1.7
  • JBoss OSGi class loader

Issue

Third JBoss node hangs when joining a cluster which is already handling requests.

Use Case

  • We run 2 Jboss nodes on different machine to create a cluster.
./standalone.sh -c standalone-full-ha.xml -b eth1 -bmanagement eth1 -u 230.0.0.9 -Djboss.node.name=node1 -Djgroups.bind.addr=eth1
./standalone.sh -c standalone-full-ha.xml -b eth2 -bmanagement eth2 -u 230.0.0.9 -Djboss.node.name=node2 -Djgroups.bind.addr=10.20.30.42
  • We are using two caches, "ejb" for the cluster communication and "testConfigRepl" for report configurations needed when handling every incoming call. The config cache is replicated and uses TCP as protocol.

  • The client has been connected to the nodes using Remote Clustered EJB

  • The client is sending request to the nodes with rate 800 requests per second. And both the servers are executing the requests successfully.

  • Let a third node join the cluster, with a similar command:
    ./standalone.sh -c standalone-full-ha.xml -b eth3 -bmanagement eth3 -u 230.0.0.9 -Djboss.node.name=node3 -Djgroups.bind.addr=eth3

Observations
After the new node has declared "JBoss EAP 6.1.0.GA started ..." and then "Starting JGroups channel" (for replicated cache), it manages to handle a couple of calls. Then it hangs.

  • A number threads have stacks similar to
"EJB default - 98" prio=10 tid=0x00007fab5c13c800 nid=0x3a18 in Object.wait() [0x00007fab43c48000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x000000077758da50> (a org.jboss.osgi.framework.internal.HostBundleClassLoader)
	at java.lang.Object.wait(Object.java:503)
	at org.jboss.modules.ConcurrentClassLoader.performLoadClassChecked(ConcurrentClassLoader.java:415)
  • A number in RUNNABLE state but waiting on condition ( they are calling different methods in the application so not all the same)
  • A number of threads blocked waiting for the classloader ( but there are not hanging at the same method invocation)
EJB default - 86"   prio=10 tid=0x00007fab681d6000 nid=0x3a08 waiting for monitor entry [0x00007fab44c5a000]
     java.lang.Thread.State: RUNNABLE
      at java.lang.Class.getDeclaredConstructors0(Native Method)
      at java.lang.Class.privateGetDeclaredConstructors(Class.java:2413)
      at java.lang.Class.getConstructor0(Class.java:2723)
      at java.lang.Class.newInstance0(Class.java:345)
      at java.lang.Class.newInstance(Class.java:327)
      at com.example....
  • ClassLoader thread stack shows the following in most of the thread dumps.
"ClassLoader Thread" daemon prio=10 tid=0x00007fab88208800 nid=0x3939 waiting on condition [0x00007fab4ef56000]
   java.lang.Thread.State: RUNNABLE
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:791)
	at org.jboss.modules.ModuleClassLoader.doDefineOrLoadClass(ModuleClassLoader.java:344)
	at org.jboss.modules.ModuleClassLoader.defineClass(ModuleClassLoader.java:422)
	at org.jboss.modules.ModuleClassLoader.loadClassLocal(ModuleClassLoader.java:260)
	at org.jboss.modules.ModuleClassLoader$1.loadClassLocal(ModuleClassLoader.java:75)
	at org.jboss.modules.FilteredIterableLocalLoader.loadClassLocal(FilteredIterableLocalLoader.java:48)
	at org.jboss.modules.Module.loadModuleClass(Module.java:526)
	at org.jboss.modules.ModuleClassLoader.findClass(ModuleClassLoader.java:188)
	at org.jboss.modules.ConcurrentClassLoader.performLoadClassUnchecked(ConcurrentClassLoader.java:444)
	at org.jboss.modules.ConcurrentClassLoader.performLoadClassChecked(ConcurrentClassLoader.java:432)
	at org.jboss.modules.ConcurrentClassLoader.access$400(ConcurrentClassLoader.java:51)
	at org.jboss.modules.ConcurrentClassLoader$LoaderThread.run(ConcurrentClassLoader.java:603)
	- locked <0x00000007755e4fc0> (a org.jboss.osgi.framework.internal.HostBundleClassLoader)

Resolution

Add -Djboss.modules.safe-jdk=true to the jvm startup options

In a future release, Red Hat will look at going to "safe JDK" mode. This change has already been made Content from github.com is not included.upstream and is captured under Content from issues.jboss.org is not included.MODULES-181 and Content from issues.jboss.org is not included.MODULES-182

This will let JBoss Modules essentially completely disregard anyone who locks a class loader for any reason since it has no relevance to the class loading process at all on 7+.
Classes are loaded on request without locking the class loader, the CL thread is never created, and deadlocks are (hopefully!) impossible.

Root Cause

All the EJB threads are trying to load a class at the same time and are bottle necking on the classloader thread the load is delegated to. That's why the threads are parked under InterpreterRuntime::resolve_invoke - the interepreter calls that method when it tries e.g. to create an instance or call a method of a class and finds that the class has not already been loaded.

Background Information


JBoss Modules normally runs in one of four modes: "safe JDK" mode, JDK 7+ mode, "lockless" mode, or "fallback" mode.
  • safe JDK mode - Designed for JRockit. Basically it means "trust that the JDK will never have this problem".

  • lockless mode - Designed for JDK 6. Under most JDK 6 implementations, when a class is loaded, the class loader is locked for the duration of the class load operation. However, if you load a class from class loader A which needs a class from class loader B, and vice-versa (from another thread), you can deadlock as the locks are acquired in opposite order. Since modules form an arbitrary network of module dependencies, this deadlock can occur easily.
    On the JDKs that are known to have this problem (this includes OpenJDK and Oracle JDK 6), special code is inserted which detects and releases the lock in order to prevent this from occurring.
    Note IBM 1.6 JVM does not have this problem, thus this code is never triggered since the class loader is never locked*.

  • JDK7+ mode - Dependency on the "Content from docs.oracle.com is not included.parallel capable" extension that was added to class loading. In this mode, there is a lock taken, however there is a separate lock for every class, and they are taken in a predictable order, thus avoiding the problem entirely. Since the class loader is never locked* in this mode, the lockless/fallback code never comes into play.

  • "fallback mode - Used in JDK 6 which does suffer from the deadlock problem but with which we cannot safely use our lockless mode. In this mode, if a class is loaded while the class loader lock is held, we release the lock in the only way possible: we use the wait() method and treat the class loader as a condition. The class load operation is then delegated to a dedicated thread. We know it can never deadlock because only one thread can ever hold the lock. However, unfortunately there are still some scenarios which can cause this strategy to fail.

This brings us (*) to the assertions of "never locked" up above. If some other piece of code decides to, for its own reasons, lock a class loader, this can mess up "fallback" mode causing it to wait for the lock to be released - and that lock might not be released until a class is loaded by the class load thread, hence deadlock.

The reason this occurs is that in JDK 7, we turn off "lockless" mode, and "safe JDK" mode is off, which leaves the "fallback" mode code in place. However it is falsely triggered because the owner of the CL lock is not in fact someone loading classes at all. Now the CL thread starts loading classes and taking locks which are potentially already held by other threads, which may in turn try to load classes (and fail, because the CL thread is waiting for a lock).

Diagnostic Steps

Capture data for analysis


If the issue is easily reproducible, do the following
  • Install openjdk 7.
    • Add the debug channel
 rhn-channel -a -c rhel-x86_64-server-6-debuginfo
  • Install the debug package
    debuginfo-install java-1.7.0-openjdk
  • Capture thread dumps (spanning 5 mins)
  • Enable trace logs for jboss classloading - server.log -
    • add verbose:class to JAVA_OPTS to capture class loading/unloading
    • enable logging for the classloader by setting the following in the equivalent standalone*.xml
   <logger category="org.jboss.modules">
      <level name="TRACE"/>
   </logger>

Or CLI

/subsystem=logging/logger=org.jboss.modules:write-attribute(name=level,value="TRACE")
  • pstack output for the jvm process
  • source code for custom classes which feature in the threadumps
  • If running on RHEL dump the following in a script and upload the resulting outputs.tar.bz2
  # top -d 15 -n 8 -b > /tmp/top.out
  # vmstat 15 8 > /tmp/vm.out
  # iostat -tkx 15 8 > /tmp/io.out
  # ps aux > /tmp/ps.out
  # ps -eLf > /tmp/psh.out
  # sar -P ALL 15 8 > /tmp/sar.out
  # cat /proc/meminfo > /tmp/meminfo.out
  # tar -cjvf outputs.tar.bz2 /tmp/*.out

Analyse the data

  • Verify there no issues with disc I/O
  • Run the following against the thread dumps captured
 egrep --context=4 -n 'ClassLoader Thread|Full thread dump|ConcurrentClassLoader.performLoadClassChecked' /path/to/threadump.log

This collects, in order, the text around
_the start of each thread dump, the dump of the classloader thread, the dump of each stalled EJB thread

  • In every dump where the EJB threads are stalled under performLoadClassChecked, verify the ClassLoader thread is RUNNABLE and executing
    defineClass1. In the thread dumps where no EJB threads are stalled, the ClassLoader thread is WAITING.

  • Verify EJB threads do not appear to be making any progress in the thread dumps captured over a period of 5 mins

    • EJB the threads are blocked at different points in application code i.e. issue is not limited to a particular class or method invocation
    • EJB some threads are "waiting on condition" or "waiting for monitor entry" e.g.
"EJB default - 10" prio=10 tid=0x00007f61f82e1800 nid=0x7b1e waiting on condition [0x00007f6220b54000]      java.lang.Thread.State: RUNNABLE
   .. .. ..
"EJB default - 93" prio=10 tid=0x00007f61f8317800 nid=0x7b7d waiting for monitor entry [0x00007f61b28ee000]      java.lang.Thread.State: RUNNABLE
  • Mapping the thread stack to the pstack output shows threads are stuck in the InterpreterRuntime native layer
"EJB default - 93" prio=10 tid=0x00007f61f8317800 nid=0x7b7d waiting for monitor entry [0x00007f61b28ee000]      java.lang.Thread.State: RUNNABLE

Thread 9 (Thread 0x7f61b28f1700 (LWP 31613)):
#0  0x00007f624fb0f7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f624eb8f05e in os::PlatformEvent::park(long) () from /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64/jre/lib/amd64/server/libjvm.so
#2  0x00007f624eb7dd0a in ObjectMonitor::EnterI(Thread*) () from /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64/jre/lib/amd64/server/libjvm.so
#3  0x00007f624eb7f981 in ObjectMonitor::enter(Thread*) () from /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64/jre/lib/amd64/server/libjvm.so
#4  0x00007f624e96a0b0 in instanceKlass::link_class_impl(instanceKlassHandle, bool, Thread*) () from /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64/jre/lib/amd64/server/libjvm.so
#5  0x00007f624e969f1f in instanceKlass::link_class_impl(instanceKlassHandle, bool, Thread*) () from /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64/jre/lib/amd64/server/libjvm.so
#6  0x00007f624e96e4bf in instanceKlass::initialize_impl(instanceKlassHandle, Thread*) () from /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64/jre/lib/amd64/server/libjvm.so
#7  0x00007f624e96ed91 in instanceKlass::initialize(Thread*) () from /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64/jre/lib/amd64/server/libjvm.so
#8  0x00007f624e9a68cb in InterpreterRuntime::_new(JavaThread*, constantPoolOopDesc*, int) () from /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64/jre/lib/amd64/server/libjvm.so
#9  0x00007f624501d9fd in ?? ()
 .. .. 
 .. ..
"EJB default - 10" prio=10 tid=0x00007f61f82e1800 nid=0x7b1e waiting on condition [0x00007f6220b54000]      java.lang.Thread.State: RUNNABLE
.. ..
Thread 92 (Thread 0x7f6220b58700 (LWP 31518)):
#0  0x00007f624fb0f43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f624eb89c3b in os::PlatformEvent::park() () from /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64/jre/lib/amd64/server/libjvm.so
#2  0x00007f624eb5a933 in Monitor::IWait(Thread*, long) () from /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64/jre/lib/amd64/server/libjvm.so
#3  0x00007f624eb5afa6 in Monitor::wait(bool, long, bool) () from /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64/jre/lib/amd64/server/libjvm.so
#4  0x00007f624ec6e568 in SystemDictionary::double_lock_wait(Handle, Thread*) () from /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64/jre/lib/amd64/server/libjvm.so
#5  0x00007f624ec73a16 in SystemDictionary::resolve_instance_class_or_null(Symbol*, Handle, Handle, Thread*) () from /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64/jre/lib/amd64/server/libjvm.so
#6  0x00007f624ec74f73 in SystemDictionary::resolve_or_fail(Symbol*, Handle, Handle, bool, Thread*) () from /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64/jre/lib/amd64/server/libjvm.so
#7  0x00007f624e86288e in constantPoolOopDesc::klass_at_impl(constantPoolHandle, int, Thread*) () from /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64/jre/lib/amd64/server/libjvm.so
#8  0x00007f624e9a6706 in InterpreterRuntime::_new(JavaThread*, constantPoolOopDesc*, int) () from /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64/jre/lib/amd64/server/libjvm.so
Components

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.