GMS flush by coordinator failed

Solution Verified - Updated

Environment

  • Red Hat JBoss Enterprise Application Plafform (EAP)
    • 5

Issue

  • One of my JBoss nodes is unable to re-join a cluster.

      WARN  [GMS] join(<local_ip_address>:7600) sent to <ip_address>:7600 timed out (after 3000 ms), retrying
    
  • The following message occurs in the log of the current master node:

      GMS flush by coordinator at <ip_address>:7600 failed
    

    How can I tell which node in the cluster is blocking and preventing other nodes from joining ?

  • We have 4 nodes in our cluster, one jvm failed to start with error "could not flush the cluster for state retrieval".

      ERROR [org.jboss.kernel.plugins.dependency.AbstractKernelController] (main) Error installing to Start: name=HAPartition state=Create
      java.lang.IllegalStateException: Node <ip_address>:7600 could not flush the cluster for state retrieval
          at org.jgroups.JChannel.getState(JChannel.java:1106)
          at org.jgroups.JChannel.getState(JChannel.java:1031)
    

Resolution

If the problem appears to be isolated to the JBM-CTRL channel, see article JBoss Messaging clustering problem after node leaves the cluster

Determine which node failed to respond to the controller's FLUSH request. In EAP 5.2 or later this is logged with the error. For earlier versions, see the diagnostic steps section. That is the node that is preventing other nodes from joining the cluster.

Once the problem node(s) is located, diagnose the reason it is not responding.

  • Excessive garbage collection: Restart the problem node, and other nodes should now be able to join.
  • Multicast packet loss: Get the network engineers to figure out if there is a multicast issue with the network. If multicast is not permitted, consider switching to TCP for cluster communications

Root Cause

When certain events occur (i.e., state transfers and view changes), a flush request is sent by the cluster coordinator. Flushing forces cluster members to send any pending messages so that a state transfer or view change can take place. With respect to this error, one or more members of the cluster failed to respond to a flush request message within a timeout.
This could be attributed to

Diagnostic Steps

To Determine which node(s) in the cluster are not responding to the FLUSH request in JBoss EAP 5.0.x or 5.1.x:

  • Determine the node requesting the flush. This will be the node logging one of:

      GMS flush by coordinator at <ip_address>:7600 failed
    

    or

      java.lang.IllegalStateException: Node 10.0.0.1:7600 could not flush the cluster for state retrieval
    
  • Set the DEBUG level category for FLUSH in conf/jboss-log4j.xml on the node requesting the flush:

      <category name="org.jgroups.protocols.pbcast.FLUSH">
          <priority value="DEBUG" />
      </category>
    
  • Let the node run with the debug logging for 1 or 2 minutes until the error repeats

  • Search the log for the string 'timed out waiting for flush responses':

      DEBUG [org.jgroups.protocols.pbcast.FLUSH] (ViewHandler,DefaultPartition-JMS-CTRL,<ip_address_0>:7600) At <ip_address_0>:7600 timed out waiting for flush responses after 10000 msec. Rejecting flush to participants [<ip_address_0>:7600, <ip_address_1>:7600, <ip_address_2>:7600, <ip_address_3>:7600, <ip_address_4>:7600, <ip_address_5>:7600]
    
  • Search backwards through the log for the previous FLUSH message. For example :

      DEBUG [org.jgroups.protocols.pbcast.FLUSH] (Incoming-15,<ip_address_0>:7600) At <ip_address_0>:7600 FLUSH_COMPLETED from <ip_address_2>:7600,completed false,flushMembers [ <ip_address_0>:7600,  <ip_address_1>:7600, <ip_address_2>:7600, <ip_address_3>:7600, <ip_address_4>:7600, <ip_address_5>:7600],flushCompleted [<ip_address_0>:7600, <ip_address_1>:7600, <ip_address_3>:7600, <ip_address_4>:7600, <ip_address_5>:7600]
    
  • The FLUSH message contains two parts, the flushMembers and the flushCompleted:

      flushMembers [ <ip_address_0>:7600,  <ip_address_1>:7600, <ip_address_2>:7600, <ip_address_3>:7600, <ip_address_4>:7600, <ip_address_5>:7600]
      flushCompleted [<ip_address_0>:7600, <ip_address_1>:7600, <ip_address_3>:7600, <ip_address_4>:7600, <ip_address_5>:7600]
    
  • Determine which ip addresses listed in the flushMembers are not in the flushCompleted list.
    In this example <ip_address_2> did not respond to the FLUSH and is blocking other members from joining.

Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.