Administrative Procedures for RHEL High Availability Clusters - Acknowledging Manual Fencing in a RHEL 6 cman Cluster

Updated

Contents

Overview

Applicable Environments

  • Red Hat Enterprise Linux (RHEL) 6 with the High Availability Add-On
  • pacemaker is not in use

Situations Where This Procedure May Be Useful

  • A node has experienced a problem, the cluster is attempting to fence it, but those fence attempts are failing for some reason. Manual intervention is needed to restore the cluster to operational status.
  • A node's hardware or power supply has failed entirely, making its fence device inaccessible as well. Other nodes need to continue providing service.
  • A cluster without shared block storage is configured with fence_kdump or another "alternative" agent, a node has stopped responding, and that agent has failed to carry out fencing. The remaining nodes in the cluster need to resume activity.

What This Procedure Accomplishes

This procedure will guide an administrator through completely powering off a problematic node - a necessary step if fencing cannot be completed automatically - and confirming to the cluster that the node has been manually fenced. This acknowledgement overrides any pending fence action for that node, and would allow normal activity to resume on the remaining nodes of the cluster.

Procedure: Acknowledging Manual Fencing

Consideration: Risk of Data Corruption

WARNING: Acknowledging that a manual fence has taken place when that node is still powered on and operational can lead to data corruption. This procedure must always include powering off a node completely before acknowledging manual fencing.

Under normal circumstances with all nodes of a cluster active and healthy, those nodes coordinate amongst themselves to decide where to place resources and when to carry out options. There are safeguards built into the cluster software and resource-agents at multiple levels to ensure that resources should only be activated on the allowed number of nodes at a time, and many configurations include protections from storage devices being activated by multiple nodes - specifically the lvm agent.

When nodes lose contact with each other, they can no longer coordinate in this manner. The nodes that should continue operating need some method to ensure any node that is not still a member cannot continue to use shared resources while they proceed to use them. The fence device is intended to be that mechanism, and typical configurations achieve this through powering off a node or cutting off its access to storage. If such a method has failed, is not available, or was not configured to begin with, then manual fencing is a remaining option.

If an administrator acknowledges manual fencing to the cluster without it actually having been powered off, then that "fenced" node may still be using resources such as shared filesystems or data stores. The act of acknowledging fencing allows other cluster members to begin accessing the same filesystems or data stores, potentially corrupting them. As such, acknowledgement should only be performed after confirmation that the node in question has been fully powered off.

Consideration: Risk of Resource Conflicts or Collisions

Even without shared storage, risks still exist if manual fencing is acknowledged without actually powering off the node. IP addresses managed by the cluster may be activated on both the "fenced" node and another node, preventing clients or other applications from reaching services hosted by the remaining members of the cluster. Data replication methods employed across cluster nodes may continue and incur conflicts or corruption by those nodes performing activities without coordination. Applications may have their own level of conflict by being active in two places at the same time.

So again, manual fence acknowledgement should only ever be performed after the node in question has been fully powered off and confirmed to be powered off.

Task: POWER OFF Problematic Node

Access the power source for the node in question and disable it.

NOTE: Do not rely on "soft" reboots, as in this state where the node has fallen out of the cluster it may fail to stop using resources or may not be able to shutdown properly, leaving the possibility of it still using shared resources. Hard-power off from outside the host operating system itself is best.

Task: Confirm Powered-Off Status of Problematic Node

If possible, use access to the server's console or power source to confirm that the server really is completely and fully powered off. If it has an integrated management interface such as an iLO, DRAC, RSA, IMM, UCS, or similar, use it to check the power state and console. If there is a physical monitor available, check it to see the system is off. If it has a power switch, log into the interface to see the status shows it as off. If it is a virtual server, use the VM viewer to confirm power status and console output.

Task: Acknowledge Manual Fencing to Remaining Cluster Nodes

On one of the remaining nodes in the cluster, acknowledge manual fencing after the server is confirmed as powered off.

# # Syntax: # fence_ack_manual -n <nodename>
# # Example: 
# fence_ack_manual -n node1.example.com

Task: Resume Operation on Remaining Cluster Nodes

After confirming fencing has been completed for all lost members that could not be fenced automatically, operation should resume automatically. Confirm resources are in the intended state and take any follow-up action that is desired.

Task: Restore Fenced Node to Service

Once the cluster is operational again, investigation into the state of the fenced node can proceed, and/or efforts to bring it back online can move forward.

Article Type