Design Guidance for RHEL High Availability Clusters - sbd Considerations
Contents
- Overview
- Deciding whether to use
sbdfencing - Designing a cluster that will use
sbdfencing - Designing
sbdfencing - Deployment and Administration Guidance
Overview
Applicable Environments
- Red Hat Enterprise Linux (RHEL) 6, 7, 8, or 9 with the High Availability Add-On
- All cluster nodes running RHEL 6 Update 8 or later, RHEL 7 Update 1 or later, RHEL 8, or RHEL 9
pacemakeris/will-be deployed on the cluster
Recommended Prior Reading
Useful References and Guides
Introduction
This guide provides Red Hat's recommendations, considerations, and essential references and knowledge for deploying sbd fencing in a RHEL High Availability cluster. The covered topics can be useful if you are considering if sbd is the right STONITH method for your cluster, or if you are trying to decide on the configuration that achieves the requirements and goals of your cluster.
Deciding whether to use sbd for fencing
Is the cluster compatible with sbd: The following basic requirements should be considered first when deciding whether to use sbd:
For details on supported conditions, see: Support policies - sbd and fence_sbd
- All releases:
- A suitable watchdog timer device must be available for the nodes that will utilize
sbdself-fencing
- A suitable watchdog timer device must be available for the nodes that will utilize
- RHEL 7 to 9:
- RHEL 7 Update 4 or later required for
sbdpoison-pill fencing via block-device. - RHEL 7 Update 1 or later required for other basic levels of
sbdfencing. - Clusters with an even-number of members must be able to use either
qdeviceorauto_tie_breakerfor quorum arbitration, or be switched to an odd-sized cluster with a member in a neutral location to serve as a cluster tie-breaker. - With a 2-node-cluster (using the corosync 2-node feature - 2 cluster-nodes with qdevice do not constitute a 2-node cluster), watchdog fencing does not make sense but poison-pill fencing can still be used. This is one of the most common use cases for poison-pill fencing.
- RHEL 7 Update 4 or later required for
- RHEL 6:
- RHEL 6 Update 8 or later required.
- The cluster must be running
pacemaker.sbdis not compatible withcman-onlyclusters. - Even-sized clusters are not compatible with
sbd. The cluster must consist of an odd number of members. sbdis not compatible withqdisk.
Benefits of sbd: These factors may weigh in favor of using sbd in a cluster:
sbdis not network-dependent, and so can provide a fence method for environments or situations where the network link may be severed between members.sbdoffers a reliable fence method for multiple-site clusters that may otherwise present challenges in finding a suitable STONITH/fencing design.sbddoes not require storing security credentials on nodes for power or storage management devices.- Self-fencing through
sbddoes not depend on the health or responsiveness of an external device. - With a watchdog timer device typically being integrated into the processor, motherboard, or virtualization platform of the cluster node - if the watchdog component fails, it usually means the node itself has failed. So, even the failure scenario for this device type can lead to successful self-fencing.
Potential negative considerations against sbd: These factors may weigh against using sbd in a cluster:
- Red Hat recommends using
sbdwith a hardware watchdog timer device (when available to the cluster nodes). If a hardware watchdog is not available for the cluster nodes to use then a software watchdog can be used. If a software watchdog is used then be aware of the limitations: Software-Emulated Watchdog Known Limitations. This is a generalsbdrecommendation and applies to poison-pill fencing as well as for watchdog fencing. - Some virtualization environments like VMware platforms or RHEV may not offer a compatible watchdog timer device. For more information then see: Support Policies for RHEL High Availability Clusters - sbd and fence_sbd.
sbdnot available or supported on older releases.- Some organizations may prefer to never have nodes automatically rebooted - such as if they want to preserve the state of a system that underwent a problem so it can be investigated.
Designing a cluster that will use sbd fencing
Quorum considerations: sbd carries out some of its fencing tasks by monitoring the quorum state of a cluster and taking action against any node that is found in an unhealthy state.
- See also: Exploring components:
sbdandfence_sbd
If all nodes were to lose quorum in some failure-scenario, that could result in all nodes self-fencing - something most organizations wouldn't want to happen. If the goal of the cluster is maximum uptime, then it is especially important that quorum policies for the cluster are designed to always allow one partition of nodes to remain functional through all feasible failure scenarios.
Quorum designs to consider for the general classes of cluster architecture:
-
Single site clusters
- Use an odd number of nodes, with sufficient redundancy in the network to prevent full cluster disruptions.
- RHEL 7, 8, or 9 - Use
corosync-qdevicewith anlmsalgorithm to allow less-than-majority to serve the cluster's functions in widespread failures- Note: When using qdevice with fence_sbd, the environment must maintain a specific timeout configuration. For information on setting timeout values when your system is configured with a quorum device, see the section "Consider
SBD_WATCHDOG_TIMEOUTforsbdhealth and quorum monitoring" in Designingsbdfencing below.
- Note: When using qdevice with fence_sbd, the environment must maintain a specific timeout configuration. For information on setting timeout values when your system is configured with a quorum device, see the section "Consider
-
Multi-site clusters
- RHEL 7, 8, or 9, two-sites - Use
corosync-qdevicewith theffsalgorithm - RHEL 7 or 8, more than two-sites - Use
corosync-qdevicewithlmsalgorithm - Two-sites - Use a "tie-breaker" node in a neutral third site
- RHEL 7, 8, or 9, two-sites - Use
When the auto_tie_breaker is used in even-number member clusters, then the failure of the partition containing the auto_tie_breaker_node (by default the node with lowest ID) will cause other partition to become inquorate and it will self-fence. In 2-node clusters with auto_tie_breaker this means that failure of node favoured by auto_tie_breaker_node (typically nodeid 1) will result in reboot of other node (typically nodeid 2) that detects the inquorate state. If this is undesirable then corosync-qdevice can be used instead of the auto_tie_breaker to provide additional vote to quorum making behaviour closer to odd-number member clusters.
Additional STONITH configuration outside sbd: sbd does not have to be the only component in a cluster's STONITH topology. Consider the following when looking at the complete STONITH layout for a cluster:
- Traditional STONITH methods (power or storage) can still be useful either as redundancy to
sbd, or as a potentially quicker option thansbdthat would fall back tosbdif it fails or is too slow. - Diagnostic methods like
fence_kdumpcan be useful to help capture data in the event of an unexpected kernel panic or other problematic situation that needs investigation. - RHEL 7 or 8:
sbdpoison-pill fencing via block-device can be ordered against other devices within STONITH levels just as other device types. - RHEL 6: With no option for
sbdpoison-pill fencing via block-device, there is no other fallback or possible short-cut method behindsbdstonith-watchdog-timeoutfencing. An additional device can offer another option for complete coverage and possible quicker recovery from membership events. sbdstonith-watchdog-timeoutfencing will execute in parallel with any other devices configured. If no STONITH device has completed the action successfully by the time thestonith-watchdog-timeouthas passed, watchdog-timeout fencing succeeds.
In summary: If another STONITH method is available - such as a power-controller for which a fence-agent exists - it can be beneficial to use it in parallel with sbd.
Designing sbd fencing
Decide upon and verify suitable watchdog device: Consult the following for more detail on suitable watchdog devices:
- Support Policies:
sbdandfence_sbd - Diagnostic Procedure: Validating a Watchdog Timer Device (WDT) for Usage with
sbd
A watchdog timer device is required on each node that will be managed by sbd. This watchdog method provides the "fencing" action when a cluster problem arises.
RHEL 7, 8 or 9: Decide whether sbd poison-pill fencing via block-device will be enabled: RHEL 7 Update 4 or later clusters can take advantage of an optional extra mode of fencing offered by sbd wherein fencing instructions can be communicated over shared-storage devices.
- See also: Exploring components
Red Hat recommends using this method unless some factor weighs heavily against it - such as a lack of shared storage.
The use of shared storage does not remove the need for a watchdog device. sbd poison-pill fencing via block-device requires a watchdog timer device and shared storage to send the poison-pill messages over.
If not using sbd poison-pill fencing via block-device, it is recommended to configure an additional STONITH method - such as a power-source with a fence-agent available to control it. sbd is more reliable with poison-pill fencing via block-device, but usage of sbd without it is still supported. Adding a fallback STONITH method beyond sbd when poison-pill is not in use can improve reliability of the cluster and may improve failure-recovery times.
Storage considerations with sbd poison-pill fencing via block-device: See Red Hat's support policies for sbd for requirements and conditions around the usage of shared storage for sbd poison-pill fencing via block-device:
How to configure sbd block storage across multiple storage arrays with sbd poison-pill fencing via block-device: Clusters with members spread across multiple distinct sites typically have their storage devices arranged across multiple storage arrays or targets, or there may be multiple redundant targets deployed for a single site needing additional redundancy. These setups typically come with some form of block-level replication provided by the storage target, keeping the data either synchronized or backed up between the targets for redundancy and availability across sites.
Keep in mind that replication of sbd's block devices does not necessarily mean that sbd poison-pill fencing via block-device will be able to successfully fence nodes across-sites if there is a network split. Storage replication solutions often have some mechanism for detecting network splits between sites and choosing one side to continue having access to the storage. If that happens, one side may be cut-off from writing, unable to either send fence commands or to reply to fence requests sent to it.
So, whether replication available or not - it is recommended to have a fall-back method behind sbd poison-pill fencing via fence_sbd. This fall-back method needs to be in addition to sbd-fencing, such as an external fence mechanism (ipmi device, fabric fencing, etc). sbd-watchdog-timeout fencing can NOT be utilized in this scenario, as utilizing both the shared storage and watchdog timeout can allow a false-positive survival of a node where it should have rebooted its self, but didn't because it was still checking in with the poison pill device.
How many block devices to use with sbd poison-pill fencing via block-device?: 1-3 devices are allowed. Sending/receiving messages must be able to succeed over a majority of devices - with 3 total, at least 2 must succeed; with 2, 2 must succeed; with 1, 1 must succeed.
Simplified guidelines:
- If using a single storage target/array, just use a single block device from it. Multiple devices from one array does not substantially improve redundancy.
- If using two storage targets, use a device from each one on all nodes. Ideally then present a device from a third neutral site as a tie-breaker - perhaps an iSCSI device. If no third-site device is available, then just the two devices can be used.
- If using three or more storage targets, use a device from three of them on all nodes.
To explain more completely, from the sbd(8) manpage:
One device
In its most simple implementation, you use one device only. This is appropriate for clusters where all
your data is on the same shared storage (with internal redundancy) anyway. Here it is obvious that the
SBD device does not introduce an additional single point of failure then.
With Pacemaker integration active (default) even a cluster with resources that can operate
independently from a shared storage doesn’t suffer from addition of a single point of failure
as a node won’t self-fence as long as it sees a quorate number of nodes (the other node in case
of a 2-node cluster).
If the SBD device is not accessible, the daemon will fail to start and inhibit
resource manager (Pacemaker) startup.
Two devices
This configuration is a trade-off, primarily aimed at environments where host-based mirroring is used,
but no third storage device is available.
SBD will not commit suicide if it loses access to one mirror leg; this allows the cluster to continue
to function even in the face of one outage.
However, SBD will not fence the other side while only one mirror leg is available, since it does not
have enough knowledge to detect an asymmetric split of the storage. So it will not be able to
automatically tolerate a second failure while one of the storage arrays is down. (Though you can use
the appropriate [pcs] command to acknowledge the fence manually.) With pacemaker integration active
(default) the benefit of this configuration might not be worth the additional effort.
It will not start unless both devices are accessible on boot.
Three devices
In this most reliable and recommended configuration, SBD will only self-fence if more than one device
is lost; hence, this configuration is resilient against temporary single device outages (be it due to
failures or maintenance). Fencing messages can still be successfully relayed if at least two devices
remain accessible.
This configuration is appropriate for more complex scenarios where storage is not confined to a single
array. For example, host-based mirroring solutions could have one SBD per mirror leg (not mirrored
itself), and an additional tie-breaker on iSCSI.
It will only start if at least two devices are accessible on boot.
Consider values for base sbd settings: All sbd deployments are affected by a few basic settings:
For details on available settings, see also: Exploring components: sbd and fence_sbd
The following settings can be tweaked at the time of enabling sbd through pcs stonith sbd enable - any settings not mentioned are not recommended for tweaking in typical setups:
-
SBD_DELAY_START: Can be useful to set a delay on startup if the cluster needs extra time to stabilize and have storage devices become available. Clusters that need nodes to come into service quickly should leave this off or low, whereas clusters that are usually carefully started under administrator control can have more of a delay to prevent early hiccups. -
SBD_STARTMODE: A value of "always" (the default) allows a recently-fenced node to be able to startsbdback up right away without administrator intervention - making it the ideal setting for most clusters. If an organization wishes to have more opportunity to inspect and investigate a node after fencing before it rejoins a cluster, a value of "clean" can allow for that.
Consider SBD_WATCHDOG_TIMEOUT for sbd health and quorum monitoring
Explanation: This setting controls the countdown timer that sbd sets in the watchdog timer device, and thus controls the length of time it will take a node to self-fence if it detects a health or quorum problem.
Relevant to: All sbd deployments
Default: 5s
Red Hat recommendation: Keep it at the default
How to configure: At sbd-setup time with
# pcs stonith sbd enable [...] SBD_WATCHDOG_TIMEOUT=<value in seconds> [...]
Configuration guidance: This is the primary factor defining how long it should take a node to be fenced via sbd methods, so consider what the cluster's recovery SLA or expectation is, and set this value within that time. There are trade-offs between high values and low values.
-
High value
- Benefits: Problems and failures have more time to work themselves out before the cluster reacts harshly by resetting nodes; less chance for unnecessary outages during temporary blips in connectivity or responsiveness.
- Downsides: Legitimate failures that can't work themselves quickly take longer to recover from; even failures that may correct themself still result in inactivity by the cluster for longer than may be acceptable.
-
Low value
- Benefits: Cluster reacts more aggressively to failures, ensuring speedy recovery as soon as there is a problem.
- Downsides: Cluster reacts possibly too quickly, taking harsh action when the issue may have sorted itself out with a bit more waiting.
- Consideration: Red Hat cautions against setting this below the default 5 seconds. This may make the cluster overly aggressive to short blips and could result in instability during periods of high activity or brief unresponsiveness.
Note: When setting the value of SBD_WATCHDOG_TIMEOUT, you should take the following into consideration:
-
If you have configured your system to run a quorum device and the value of
SBD_WATCHDOG_TIMEOUTis less than the value ofqdevice-sync_timeout, a quorum state update could be delayed for so long that it would result in a split-brain situation. The default value ofqdevice-sync_timeoutis 30s. Red Hat suggests a difference of 3-5 seconds betweenSBD_WATCHDOG_TIMEOUTandqdevice-sync_timeout, and thatstonith-watchdog-timeoutshould be at least double the value ofSBD_WATCHDOG_TIMEOUT(or should be zero to indicate that watchdog fencing should not be used). As of RHEL 8.3, Pacemaker will not start up if theSBD_WATCHDOG_TIMEOUTvalue and theqdevice-sync_timeoutvalue do not match, and if these mismatched parameters are configured on a running system SBD will issue a reboot. -
With the following errata, there is validation that is done for setting these values. Previously, it was possible to set the
stonith-watchdog-timeoutproperty to a value that is incompatible with SBD configuration. This could result in a fence loop, or could cause the cluster to consider a fencing action to be successful even if the action is not finished. With this fix,pcsvalidates the value ofstonith-watchdog-propertywhen you set it, to prevent incorrect configuration.- RHEL 8:The issue (bugzilla bug: 1954099) has been resolved with the errata RHSA-2022:7447 with the following package(s):
pcs-0.10.14-5.el8,pcs-snmp-0.10.14-5.el8or later. - The issue (bugzilla bug: 2058246) has been resolved with the errata RHSA-2022:7935 with the following package(s):
pcs-snmp-0.11.3-4.el9,pcs-0.11.3-4.el9or later.
- RHEL 8:The issue (bugzilla bug: 1954099) has been resolved with the errata RHSA-2022:7447 with the following package(s):
Consider msgwait timeout for sbd poison-pill fencing via block-device
Explanation: This setting controls how long fence_sbd's fence actions (by way of sbd message) will wait before considering the message it sent as being successful. In other words: if the node is still alive in some capacity, after this amount of time it will have had to see this message and fence itself, or will have had to fail reading from the device and fence itself.
Relevant to: Deployments with sbd poison-pill fencing via block-device
Default: 10s
Red Hat recommendation: 2x SBD_WATCHDOG_TIMEOUT - recommendation of stonith-watchdog-timeout=10s if using Red Hat-recommended SBD_WATCHDOG_TIMEOUT=5.
How to configure: At block-device-initialization time with
# pcs stonith sbd device setup msgwait-timeout=<value>
Configuration guidance: Setting to 2x the SBD_WATCHDOG_TIMEOUT ensures that a node should self-fence either because it detected its state in the cluster is problematic or because it received the message - offering an assurance of avoiding split-brain. If set lower than SBD_WATCHDOG_TIMEOUT - fencing could be declared successful prematurely if storage becomes unresponsive during a STONITH action, allowing a split-brain scenario.
Consider stonith-watchdog-timeout value for sbd stonith-watchdog-timeout fencing
Explanation: If set > 0, this setting enables sbd stonith-watchdog-timeout fencing. Enabling this measure instructs the cluster to rely on a "hidden" STONITH device whose purpose is to report successful fencing of a node after this length of time has passed since the node went missing. This works in conjunction with sbd health and quorum monitoring which should ensure that a missing node will have fenced itself within SBD_WATCHDOG_TIMEOUT seconds.
Since Pacemaker 2.0.0, setting this value to -1 means derive stonith-watchdog-timeout from SBD_WATCHDOG_TIMEOUT automatically. Red Hat does not recommend using this feature as there is a danger of split-brain when SBD_WATCHDOG_TIMEOUT is set inconsistently with different nodes.
Relevant to: Deployments with sbd stonith-watchdog-timeout fencing. Do not use with poison-pill fencing.
Default: 0 (disabled)
Red Hat recommendation: 2x SBD_WATCHDOG_TIMEOUT - recommendation of stonith-watchdog-timeout=10s if using Red Hat-recommended SBD_WATCHDOG_TIMEOUT=5.
How to configure: After sbd has already been enabled in the cluster with pcs stonith sbd enable:
# pcs property set stonith-watchdog-timeout=<value>
Configuration guidance: The stonith-watchdog-timeout cluster property should be higher than SBD_WATCHDOG_TIMEOUT, to ensure the cluster always waits long enough for a node to self-fence before the cluster resumes activity. A good rule of thumb is to set this property to twice the value of SBD_WATCHDOG_TIMEOUT - although if using a very high timeout, then doubling it may not be necessary.
NOTE: This is the amount of time that the cluster will be delayed in carrying out cluster-based activity when a node stops responding or leaves the membership. Make sure this is not set higher than any SLAs or expected response times of services in this cluster.
Consider fence_sbd pcmk_reboot_action timeout for sbd poison-pill fencing via block-device
Explanation: This is the STONITH-level timeout for how long the device it is set on will be awaited in a "reboot" action (which is the default action unless cluster-property stonith-action is set to "off").
Relevant to: Deployments with sbd poison-pill fencing via block-device - any time the msgwait attribute has been raised above its default.
Default: Pulls from the cluster-property default-action-timeout, which defaults to 20s.
Red Hat recommendation: Keep it above msgwait by 5 seconds
How to configure: In the fence_sbd STONITH device - Example:
# pcs stonith create sbd fence_sbd devices=<devices> pcmk_reboot_timeout=<value>
Configuration guidance: If msgwait is raised, it is likely out of a desire to have the cluster wait at least that long before giving up. So, this setting should be raised accordingly - above msgwait by 5 or more seconds - so that the cluster doesn't give up before msgwait has been exhausted.
Consider fence_sbd power_wait timeout for sbd poison-pill fencing via block-device
Explanation: This setting controls how long fence_sbd will wait for a response from its sbd message instruction before giving up with a failure.
Relevant to: Deployment with sbd poison-pill fencing via block-device
Default: Matches sbd msgwait parameter (read directly from block device)
Red Hat recommendation: Keep it at the default
How to configure: In the fence_sbd STONITH device - Example:
# pcs stonith create sbd fence_sbd devices=<devices> power_wait=<value in seconds>
Configuration guidance: This typically does not need to be changed. The already-configured block device's msgwait setting will dictate how long the message reply will be awaited, and changing this power_wait value does not change that. If a value is set for power_wait, it only makes sense to have it higher than msgwait.
Consider making the implicit, hidden fencing device (created by setting cluster property stonith-watchdog-timeout > 0) a visible fencing device
Explanation: This implicit, hidden fencing device would fence all nodes in the cluster.
Relevant to: Clusters where not all nodes are fit for watchdog fencing (for example, there is no usable hardware watchdog) or where you do not want watchdog fencing to target all nodes. Clusters where watchdog fencing should be combined with other fencing methods.
Default: implicit, hidden and fencing all nodes
Red Hat recommendation: Keep it at the default as long as not needed due to reasons given above
Configuration guidance: Set up a fencing device based on fence_watchdog and configure as any other fencing device. Consider setting the instance attribute pcmk_host_list to limit watchdog fencing to a certain list of nodes. Use in a fencing topology, for example to configure as fallback for other fencing devices that have failed.
Deployment and Administration Guidance
Deployment examples for enabling different levels of sbd fencing:
- Deployment example: Enabling
sbdfencing in RHEL 7 - Deployment example: Enabling
sbdfencing in RHEL 6