Recommended Practices for Applying Software Updates to a RHEL High Availability or Resilient Storage Cluster
Introduction
With one of the primary responsibilities of a High Availability or Resilient Storage cluster being to provide continuous service for applications or resources, it is especially important that updates be applied in a systematic and consistent fashion to avoid any potential disruption to the availability of those critical services. This document aims to outline Red Hat's recommended practices for applying updates to the cluster software itself and to the software comprising the base RHEL operating system, libraries, and utilities.
Environment
- Red Hat Enterprise Linux (RHEL) 5, 6, 7, 8 or 9 with the High Availability or Resilient Storage Add On
- One or more pieces of software installed on cluster nodes or remote nodes must be updated
Contents
Updating Software Packages in a RHEL High Availability and/or Resilient Storage Cluster
Important Notes
- WARNING: It is critical when performing software-update procedures for RHEL High Availability and Resilient Storage clusters to ensure that any node that will undergo updates is not an active member of the cluster before those updates are initiated. Swapping out the software that the cluster stack relies on while it is in use can lead to various problems and unexpected behaviors, including but not limited to issues that can cause complete outages of the cluster and services it is managing.
- Red Hat does not support in-place upgrades or rolling-upgrades of cluster nodes, remote nodes, and bundle container images from one major release of RHEL to another except for the limited exceptions noted below. For example, there is no supported method for updating some nodes in a cluster from RHEL 6 to RHEL 7, introducing them into the cluster with existing RHEL 6 nodes to take over resources from them, and then updating the remaining RHEL 6 nodes. Upgrades in major releases of RHEL must be done by migrating services from a running cluster on the old release to another cluster running the new release.
- Upgrade of systems using High Availability add-on from RHEL 6 to RHEL 7 is unsupported.
- Upgrade of systems using High Availability add-on from RHEL 7 to RHEL 8 is unsupported .
- In place upgrade of systems using High Availability add-on is generally unsupported. The only supported rolling upgrades of cluster nodes, remote nodes, and bundle container images is listed below.
- RHEL 8.8+ to RHEL 9.2+. For more information on limitations and procedure then see:Procedure to upgrade a RHEL 8 High Availability cluster to RHEL 9. Do note that it is not supported to have Resilient Storage packages installed on cluster nodes that while performing the procedure.
- RHEL 9+ to RHEL 10+. For more information on limitations and procedure then see:Procedure to upgrade a RHEL 9 High Availability cluster to RHEL 10. Do note that it is not supported to have Resilient Storage packages installed on cluster nodes that while performing the procedure.
- Red Hat does not support rolling upgrades of shared storage that is exported with
samba+ctdb: Does ctdb shared storage support rolling upgrades? - While in the process of performing an update, do not make any changes to your cluster configuration. For example, do not add or remove resources or constraints.
- Although it is not required, when upgrading a Pacemaker cluster it is a good practice to upgrade all cluster nodes before upgrading any Pacemaker Remote nodes or
podman(Docker) containers used in bundles. - Red Hat supports applying Kernel live patches on member nodes of a RHEL High Availability or Resilient Storage Cluster. Please see the Recommended Practices for using Kernel Live Patching in RHEL High Availability or Resilient Storage Clusters article for recommended practices and supported RHEL and Kernel versions.
- The Red Hat Enterprise Linux (RHEL) Resilient Storage Add-On will no longer be supported starting with Red Hat Enterprise Linux 10 and any subsequent releases after RHEL 10. The RHEL Resilient Storage Add-On will continue to be supported with earlier versions of RHEL (7, 8, 9) and throughout their respective maintenance support lifecycles.
Please feel free to contact Red Hat Global Support Services for assistance in planning an update, upgrade, or migration of any kind. Proper planning and risk mitigation is key to a successful update or migration, and Red Hat's experts can assist in ensuring the process goes as smoothly as possible.
General Overview of Update Procedures
Updating packages that make up the RHEL High Availability and Resilient Storage Add-Ons, either individually or as a whole, can be done in one of two general ways:
-
Rolling Updates: The basic idea is to take a fully-formed and active cluster, remove one node from service by stopping its relevant services and daemons, update its software, then integrate it back into the cluster before repeating the procedure on another node. This allows for the cluster to continue providing service and managing resources while each node is updated, and allowing the update node(s) to provide service while bringing the remaining node(s) up to the same software level. The node undergoing an update at each stage should not be a member of the cluster while the update is ongoing.
-
Entire Cluster Update: When a cluster is able to undergo a complete outage, it can simplify update procedures greatly. Such situations allow for stopping the entire cluster, applying updates to all nodes simultaneously (or one-after-another, if preferred), and then starting the cluster back up together. One of the primary benefits of such a procedure is that there is no time when nodes should be running separate versions of the software, thereby eliminating any risk of incompatibilities or unexpected behavior due to such mismatches. This option also eliminates any complexity that might exist with repeatedly moving resources around in the cluster to accommodate each node stopping and then rejoining.
Risks and Considerations
-
When performing a Rolling Update, the presence of different versions of the High Availability and Resilient Storage within the same cluster introduces a risk that there may be unexpected behavior. While Red Hat does seek to eliminate any known incompatibilities between different releases within the same major release of RHEL, it also only performs limited testing of different versions of the software operating simultaneously. It is always possible that some previously unforeseen incompatibility between versions could cause unexpected behavior, so the only way to completely eliminate this risk is to use the Entire Cluster Update method.
-
New software versions always come with the potential for unexpected behavior, changes in functionality that may require advance preparation, or in rare cases, bugs causing that could impact the operation of the product. Red Hat strongly recommends having a test, development, or staging cluster configured identically to any production clusters, and using such a cluster to roll out any updates to first for thorough testing prior to the roll-out in production.
-
Performing a Rolling Update necessarily means reducing the overall capacity and redundancy within the cluster. The size of the cluster dictates whether the absence of a single node poses a significant risk, with larger clusters obviously being able to absorb more node failures before reaching the critical limit, and with smaller clusters being less capable or not capable at all of withstanding the failure of another node while one is missing. It is important that the potential for failure of additional nodes during the update procedure be considered and accounted for. If at all possible, taking a complete outage and updating the cluster entirely may be the preferred option so as to not leave the cluster operating in a state where additional failures could lead to an unexpected outage.
-
Updates in the pacemaker package sometimes bring a change in the pacemaker's crm_feature_set. This may introduce a risk when performing rolling updates as the cluster requires all the nodes to run the same crm_feature_set. If that happens you would need to update all the nodes before the updated ones can rejoin. We recommend checking if your update will bring a change in crm_feature_set Content from projects.clusterlabs.org is not included.here and testing the update procedure in a test or staging environment before updating production.
Video Walkthrough
Procedure for Rolling Updates
The specific steps to follow differs depending on the RHEL release and style of cluster in use.
- Only run
pcscommands on updated nodes whenever possible. The newer version should always be
backward-compatible, but the older version may not be forward-compatible. - Once the last active older cluster node has been taken out of the cluster, no older cluster node will be
able to rejoin without being updated first.
RHEL 6, 7, 8 and 9 Clusters using pacemaker
Perform the following steps to update the base RHEL packages, High Availability Add-On packages, and/or Resilient Storage Add-On packages on each node in a rolling-fashion:
-
Choose a single node where the software will be updated, to help reduce downtime we would recommend first migrating nodes running the smallest amount of services or a cluster node that is the passive cluster node for promotable pacemaker managed resources. If any preparations need to be made before stopping or moving the resources or software running on that node, carry out those steps now .
-
Before starting the procedure it is recommended that the systemd service
pacemakeris disabled from starting at boot. Disablingpacemakerfrom starting at boot will preventpacemakerfrom starting until verifying that all components managed by the cluster actually still work as expected.
The cluster stack can be disabled from starting on boot on this chosen node with:
# # Syntax: # pcs cluster disable [<node>]
# # Example:
# pcs cluster disable node1.example.com
Enabling pacemaker to start at boot should only be enabled after it has been verified that the cluster is still able to manage all the clustered managed resources without issues.
- If the cluster is composed of 3 or more cluster nodes then move any managed resources off of this node as needed. If there are specific requirements or preferences for where the resources should be relocated to, then consider creating new location constraints to place the resources on the correct node. The location of resources can be strategically chosen to result in the least number of moves throughout the Rolling Update procedure, rather than moving resources in preparation for every single node update.
Otherwise if allowing the cluster to automatically manage placement of resources on its own is acceptable, then the next step will automatically take care of this.
- Place the chosen node in standby mode to ensure it is not considered in service, and to cause any remaining resources to be relocated elsewhere or stopped. Before proceeding to step 5 please monitor
pcs statusto make sure all resources have been moved off the node actively being updated.
# # Syntax: # pcs node standby [<node>]
# # Example:
# pcs node standby node1.example.com
- Stop the cluster software on the chosen node using
pcs:
# # Syntax: # pcs cluster stop [<node>]
# # Example:
# pcs cluster stop node1.example.com
-
Perform any necessary software updates on the chosen node. There are various methods for doing so that are outside the scope of this article. Consult the general instructions for installing High Availability and Resilient Storage software, Knowledge Content in the Customer Portal, and/or the Product Documentation. After a cluster node has been updated it must be manually verified that all components running on the node that are managed by the cluster are still working as expected to avoid unnecessary failures to be triggered when the cluster node is a member of the cluster managing cluster resource when the cluster stack is started again on the cluster node.
-
If any software was updated that necessitates a reboot, prepare to perform that reboot. It is recommended that cluster software be disabled from starting on boot so that the host can be checked to ensure it is fully functional on its new software versions before bringing it into the cluster as noted in step 2.
Perform the reboot when ready, and when complete, ensure the host seems to be fully functional and is using the correct software in any relevant areas (such as having booted into the latest kernel). If anything does not seem correct, then do not proceed until the situation is resolved. Contact Red Hat Global Support Services for assistance if needed.
- Rejoin the updated node into the cluster.
# # Syntax: # pcs cluster start [<node>]
# # Example:
# pcs cluster start node1.example.com
Check pcs status output to determine if everything appears as it should. Once the node seems to be functioning properly, reactivate it for service by taking it out of standby mode:
# # Syntax: # pcs node unstandby [<node>]
# # Example:
# pcs node unstandby node1.example.com
-
If any temporary location constraints were created in step 2 to control the placement of resources, then adjust or remove them to allow resources to go back to their normally preferred locations.
-
After verifying that all
pacemakermanaged cluster resources are able to run on the cluster node then enablepacemakerto start at boot.
# Syntax: # pcs cluster enable []
# Example:
pcs cluster enable node1.example.com
-
Repeat steps 1-8 for each remaining node.
-
After all the cluster nodes or remote nodes have been upgraded then run the following:
pcs cluster cib-upgrade
RHEL 5 and 6 with cman (no pacemaker)
Perform the following steps to update the base RHEL packages, High Availability Add-On packages, and/or Resilient Storage Add-On packages on each node in a rolling-fashion:
1.) Choose a single node where the software will be updated. If any preparations need to be made before stopping or moving the resources or software running on that node, carry out those steps now.
2.) Move any managed resources off of this node as needed. If there are specific requirements or preferences for where resources should be relocated to, then consider moving them with clusvcadm, the Conga web administration interface, or ccs (RHEL 6 only) to place the resources on the correct node. Otherwise if allowing the cluster to manage placement of resources on its own is acceptable, then the next step will automatically take care of this.
3.) Stop all running cluster daemons on this chosen node. In RHEL 6 this can be done easily using ccs:
# # Syntax: # ccs -h <hostname> --stop
# # Example:
# ccs -h node1.example.com --stop
NOTE: This also disables the cluster daemons from starting on boot.
4.) Perform any necessary software updates. There are various methods for doing so that are outside the scope of this article. Consult the general instructions for installing High Availability and Resilient Storage software, Knowledge Content in the Customer Portal, and/or the Product Documentation.
5.) If any software was updated that necessitates a reboot, prepare to perform that reboot. It is recommended that cluster software be disabled from starting on boot so that the host can be checked to ensure it is fully functional on its new software versions before bringing it into the cluster. The cluster daemons can be disabled from starting on boot on this chosen node with chkconfig <service> off, or using ccs as seen in step 3 above.
Perform the reboot when ready, and when complete, ensure the host seems to be fully functional and is using the correct software in any relevant areas (such as having booted into the latest kernel). If anything does not seem correct, then do not proceed until the situation is resolved. Contact Red Hat Global Support Services for assistance if needed.
Once everything appears to be set up correctly, re-enable the cluster daemons on this node using chkconfig, or with ccs as described in step 6 below.
6.) Rejoin the updated node into the cluster by starting the cluster daemons, or by using ccs in RHEL 6:
# # Syntax: # ccs -h <hostname> --start
# # Example:
# ccs -h node1.example.com --start
NOTE: This enables cluster daemons to start automatically on boot.
Once all daemons are started, the node should be fully functional and managing resources within the cluster.
7.) Repeat steps 1-6 for each remaining node.
Procedure for Entire Cluster Update
The process for updating an entire cluster at once is nearly identical to the Rolling Update procedure above, with the single difference being that each step should be performed on all nodes before moving on to the next step. So, for example, stop the cluster daemons on each node before moving on to updating the software, and reboot each node before moving on to re-enabling the cluster software, etc. In the end, the goal is to stop the cluster software on all nodes, update those nodes, then start the cluster software again. The above steps can be used as a guide, and may even be simplified to skip some of the preparation steps if they are not required.