What is the recommended approach for doing rolling upgrades in a JBoss cluster

Solution Verified - Updated

Environment

  • Red Hat JBoss Enterprise Application Platform (EAP)
    • 8.x
    • 7.x
    • 6.x
    • 5.x
    • 4.x
  • mod_cluster
  • mod_jk
  • Apache

Issue

  • I have an environment with multiple nodes in a cluster, and I need to update some configuration files in each cluster node. What is the recommended approach to avoid a down time ?

  • What is the recommended approach to upgrading a jboss cluster? We have a Singleton (which is NOT idempotent)  I can only think that in this case all nodes of the cluster must be fully stopped at the same time. Is there a best practice for these issues?

  • Some say we shouldn't need to do a full cluster restart at all when the various .xml and java packages are changed. Is this ok? Are we at risk for any issues if we do this?

  • I need to deploy a new version of an application to all nodes in the cluster and the changes are incompatible with the old ones.

  • Can we set a flag on JBoss AS which tells the web server (load balancer) that its state (ready or not ready) and the web server will only connect to the node that is ready and running?

  • How can we do rolling deployment without downtime in EAP 6 ?

Resolution

Incompatible Upgrades:

The aim is to incrementally phase in your new application while ensuring the old application can complete current requests and service new requests.

Using mod_jk

  1. Disable half the workers in workers.properties. Lets call it Batch-A. Go into the http://host/jkstatus page for mod_jk and set a node to disabled, this will allow current sessions to keep making requests, but not allow any new requests. The worker will continue to service current sessions until they expire, but will not accept new requests. But you need to ensure there is no failover to node(s) while this is in progress. You can check active session count through a node's jmx-console. Log in and follow the host=localhost,path=/applicationContext,type=Manager link for your application and check the session statistics reported here.
  2. After the sessions have timed out, do the upgrade and restart the nodes while This content is not included.isolating JBoss clusters running on the same network. This will create two clusters on the network
  3. Enable the worker from (1) above
  4. Disable the remaining workers. Let's call that Batch-B. Here again the workers will continue to service current sessions until the sessions expire, but the workers will not accept new requests. Make sure there is no failover to this set of nodes while this is in progress
  5. After the sessions from workers in (4) have timed out, do the upgrade specifying the same startup options as 2) above (all nodes will now end up in the same cluster)

Using mod_cluster

  1. Disable the application contexts for half the nodes on the mod_cluster-manager page. Lets call it Batch-A. This will allow current sessions to keep making requests, but not allow any new requests to these nodes. You can check active session count through a node's jmx-console. Log in and follow the host=localhost,path=/applicationContext,type=Manager link for your application and check the session statistics reported here.
  2. After the sessions have timed out, do the upgrade on these nodes and restart them. Start the upgraded nodes up in a separate cluster per This content is not included.isolating JBoss clusters running on the same network. This will create two clusters on the network.
  3. The upgraded nodes should be enabled again automatically following the restart.
  4. Disable the application contexts for the remaining nodes on the mod_cluster-manager page. Let's call that Batch-B. Here again the workers will continue to service current sessions until the sessions expire, but the nodes will not accept new requests. Make sure there is no failover to this set of nodes while this is in progress
  5. After the sessions from workers in (4) have timed out, do the upgrade specifying the same startup options as 2) above (all nodes will now end up in the same cluster)

Separating the clusters in step (2) prevents cross talk between clusters while the updates are in progress.

When you need more than 50% of production up


If you cannot run production with 50% capacity during an off peak time, you should probably consider increasing your production capacity. However there is a way to do a rolling restart with less nodes down at once. The steps are similar to the above: 1) Disable a single node, upgrade it, and bring it up in a new cluster 2) Configure the load balancer to send a fraction 1/(n-1) of new session traffic to the new cluster nodes (so with four nodes, 1/3 of the traffic, and disable a second node in the old cluster. Configure failover so that sessions from one node can only fail over to other nodes in the same cluster 3) Repeat (2) for each node, increasing the fraction directed to the new cluster each time. The fraction that should be directed at the new cluster when you have upgrade M of N node is (m/(n-1))

When doing this, the periods when singleton clusters exist (upgrading the second and second-last nodes) are very sensitive to failures. Because there are no other nodes that have replicated copies of the single node's sessions, a failure will result in those sessions being lost.

Limited time windows


If you have a limited time window to perform the upgrade, you may not be able to wait until all existing sessions have finished on a node. To work around this, you need to configure how the load balancer performs failover. You need to configure it so that sessions for your victim node will only fail over to members of the same cluster. Failover to the opposite cluster will cause session loss since replication does not occur between clusters.

Compatible Upgrades:

Restart a few nodes at a time until you have rolled out all the changes to the entire cluster.

Note:

  • Incompatible upgrades: Where the old and new versions cannot coexist in the same cluster/environment due to issues with data corruption, lost session data, end user experience
  • Compatible upgrades: The changes are such that it is not a problem to have the old and the new deployed side by side. No issues with data corruption and persistent data etc.
Components

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.