An OOME/hung JVM causes the domain controller to become non-responsive in EAP 6

Solution Verified - Updated

Environment

  • Red Hat JBoss Enterprise Application Platform (EAP)
    • 6.x
      • Domain Mode

Issue

  • When using domain mode, we have found that a hung JVM can cause the domain controller/host controller to become non-responsive.
  • The domain controller's CLI was locked (along with host's CLI) while a JVM was hung. One you kill the offending JVM then the domain/host controller recovers and functions as normal.
  • Is there any way to sync this host to domain controller through CLI commands ?

Resolution

  • Migrate to EAP 6.4.13
  • In EAP 6 versions prior to 6.4.13, to clear up the hang of the CLI operations, you will need to issue a kill -9 against the hung JVM's PID as well as the JVM's host controller. Once that is done, then you can restart the host controller and it will reconnect to the master.
  • There are a couple of better solutions This content is not included.starting in EAP 6.3.0 to try before the kill -9:
    • EAP 6.3.0 supports a timeout for potentially hung management operations, preventing long or potentially permanent domain wait states. This timeout can be set through the jboss.as.management.blocking.timeout system property (default is 300 seconds, i.e. 5 minutes).
    • In certain circumstances Content from docs.jboss.org is not included.a management operation can be canceled. First, issue a CLI command to list any long-running operations that could be hung:
[Standalone] : 
    /core-service=management/service=management-operations:find-non-progressing-operation

[Domain]:
    /host=[Host name]/core-service=management/service=management-operations:find-non-progressing-operation

Then issue the command to cancel the operation:

[Standalone] : 
    /core-service=management/service=management-operations:cancel-non-progressing-operation

[Domain]:
    /host=[Host name]/core-service=management/service=management-operations:cancel-non-progressing-operation

The find operation does not need to be issued before the cancel, but it can be helpful to understand the status of the long-running operation before cancelling it.

Root Cause

Once a managed node is in a hung state, issuing an undeploy command causes the CLI to hangs while it waits for the application to undeploy. After initializing the undeployment operation, a connection can be made successfully, but issuing reload --host=slave on the slave DC just hangs. When the undeploy command is killed the other commands issued return without actually doing any work. It does not return a status or anything. If a connection to the slave controller is made and the reload command is issued it hangs as well.

The server log on the slave DC will look similar to this, however the master DC does not have the slave host as an option in the reload --host= command at this point.

Slave console:

[Host Controller] 17:35:06,633 WARN  [org.jboss.as.host.controller] (Remoting "slave:MANAGEMENT" read-1) JBAS010914: Connection to remote host-controller closed. Trying to reconnect.
[Host Controller] 17:35:06,893 INFO  [org.jboss.as.host.controller] (domain-connection-threads - 9) JBAS010916: Reconnected to master

Master console:

[Host Controller] 13:35:07,269 ERROR [org.jboss.as.domain.deployment] (management-handler-thread - 7) JBAS010809: ConcurrentUpdateTask caught InterruptedException waiting for task org.jboss.as.domain.controller.plan.ConcurrentServerGroupUpdateTask@7f4f4c63; returning
[Host Controller] 13:35:07,307 WARN  [org.jboss.as.domain] (Remoting "master:MANAGEMENT" task-2) JBAS010929: Connection to remote host "slave" closed unexpectedly
[Host Controller] 13:35:07,307 INFO  [org.jboss.as.domain] (Remoting "master:MANAGEMENT" task-2) JBAS010925: Unregistered remote slave host "slave"
[Host Controller] 13:35:07,308 WARN  [org.jboss.as.host.controller] (management-handler-thread - 7) JBAS010802: Interrupted awaiting final response from server server-one on host master

Slave CLI command:

[domain@localhost:9999 /] reload --host=slave
JBAS014883: No resource definition is registered for address [("host" => "slave")]

Even when the console is terminated for the domain controller, exceptions are thrown and it still cannot stop server-one (the hung JVM):

[Host Controller] 13:50:45,203 WARN  [org.jboss.as.controller] (Host Controller Service Threads - 34) JBAS014618: Graceful shutdown of the handler used for native management requests did not complete within [15000] ms but shutdown of the underlying communication channel is proceeding
[Host Controller] 13:50:45,211 INFO  [org.jboss.as.host.controller] (Host Controller Service Threads - 20) JBAS010923: Stopping server server-two
[Host Controller] 13:50:45,212 INFO  [org.jboss.as.host.controller] (Host Controller Service Threads - 20) JBAS010923: Stopping server server-one
13:50:45,214 INFO  [org.jboss.as.process.Server:server-one.status] (ProcessController-threads - 7) JBAS012018: Stopping process 'Server:server-one'

The hung JVM must be killed via kill -9 against the server-one PID to stop it. The slave server also needs to be killed and restarted manually. After all of the hung process/host controller is killed manually, it can be restarted and will join the domain as if nothing is wrong.

Diagnostic Steps

To Sync the Host controller back with the Domain controller try reloading the HC via CLI

You can try to reload explicitly like the below steps . If DC is still running can you try to connect to "Slave host controller" via CLI and then reload

  1. connect to Slave Host controller
$JBOSS_HOME/bin/jboss-cli.sh -c --controller=<slave-host>:9999
  1. Explicit reload from Slave HC.
[domain@<slave-host>:9999 /] reload --host=<slave-name> --restart-servers=false
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.