JGroups locked for ~15 minutes during high load and network glitch
Environment
- Red Hat JBoss Enterprise Application Platform (EAP)
- 7
- Red Hat Data Grid (RHDG)
- 7
Issue
- During network connection failure the failure detection (FD) of JGroups is blocked and the cluster communication is badly affected
- If the network connection is lost for some reason the cluster will 'hang' for 15 minutes until the instance is expelled, but from the JGroups settings it is expected to fail after about 40sec
- If a node is blocked or disconnected during high traffic and should be expelled from the cluster this take longer than expected and the cluster will not work properly
Resolution
As the instance is already marked as dead JGroups will drop all messages to/from this node and the socket.close will clear the buffers as well there is no need to use the synchronization.
The connection can be closed immediately.
This issues has been solved with JGroups version 4.0.20 which is scheduled for EAP 7.2.4 and RHDG 7.3.2
Root Cause
JGroups try to close the connection if there is no heartbeat response or data received from that instance, but the output- and input-stream are buffered and closing it might trigger a flush. As JGroups use the same lock the .close() will block if the TCP buffer is full. So it depends on the system network settings which kind of load will trigger it and how long the processes are blocked.
The issue is tracked by the following Jira's
Content from issues.jboss.org is not included.JGroups JGRP-2350
Content from issues.jboss.org is not included.EAP JBEAP-17161
Content from issues.jboss.org is not included.JDG JDG-2922
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.