What is the proper way to simulate a network failure on a RHEL Cluster?
Environment
- Red Hat Enterprise Linux Server 5, 6, 7, 8, or 9 (with the High Availability Add on)
Issue
- What is the proper way to simulate a network failure on a RHEL Cluster?
- Is doing an
ifdownandifupa proper way to simulate that the network is down on the heartbeat network for a RHEL cluster node? - Why didn't my service group failover when I performed an ifdown on the heartbeat interface?
Resolution
There are a few ways to simulate a network failure. In all the testing procedures below(or any others not document) there is one requirement for the procedure to be a valid method for testing a network interface failure which is that the network interface always has to have the flags of UP and LOWER_UP.
For example here is output showing the different flags for a network device:
# ip link show eth1
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 52:54:00:3a:ce:bd brd ff:ff:ff:ff:ff:ff
The flag LOWER_UP is the physical layer link(also known as layer 1) flag which means if the LOWER_UP was set then your Ethernet cable was plugged in and connected to a network(could see another connected device). The UP flag refers to the data link layer (also known as layer 2), which means that you can send and receive ethernet packets.
The LOWER_UP is always enabled on virtual machines(KVM) and is unreliable method to determine actual state of physical link. The only reliable testing methods are the ones outlined below for virtual machines and physical machines which leave the network interfaces in a valid state for testing network faults.
List of supported testing procedure to simulate a faulty network
In all methods to simulate a faulty network, if the cluster is using [redundant heartbeat network or RRP](/solutions/61832) or [network bonding](/articles/3068841) then then all network paths will need to made faulty in order for the cluster node to lose communication with the other nodes.
For all commands that are listed below will need to be run as the root user.
Physical machines only
- Pull the network cable that is used for cluster communication on one of the cluster nodes.
- Bring down the corresponding port on the switch.
- Use a firewall to block the ports used by
corosync.
Virtual machines only
-
On the virtual machine(that is a member of the cluster) disconnect the network interface from the bridge, so traffic on that interface goes nowhere. For example run the following command with the correct arguments on the virtual machine's physical host:
# brctl delif <brname> <ifname> -
Use a firewall to block the ports used by
corosync.
Simulate a network failure with a firewall
Physical and Virtual machines on RHEL 5 and RHEL 6
Start `iptables` and use `iptables` to drop both incoming and outgoing traffic for the IP associated with heartbeat traffic for the particular cluster node. Below is an example script called `net_breaker.sh` that will drop both incoming and outgoing traffic for the IP associated with heartbeat traffic. *If are using a [redundant heartbeat network or RRP](/solutions/61832) then this script will need to be ran twice because there are two heartbeats.*
The script assumes that iptables is being used as the firewall solution and is running.
#!/bin/sh
# Usage:
# ~/bin/net_breaker BreakCommCmd 192.168.1.101
set -e
if [ $1 = "BreakCommCmd" ]; then
iptables -A INPUT -s $2 -j DROP >/dev/null 2>&1
iptables -A OUTPUT -s $2 -j DROP >/dev/null 2>&1
iptables -A INPUT -m pkttype --pkt-type multicast -j DROP
fi
if [ $1 = "FixCommCmd" ]; then
iptables -F >/dev/null 2>&1
fi
exit 0
Here is an example on how to run the script where one rings used for corosync is 192.168.1.101.
# chmod 700 ~/bin/net_breaker
# ~/bin/net_breaker BreakCommCmd 192.168.1.101
Then to remove the rule run the following:
# ~/bin/net_breaker FixCommCmd 192.168.1.101
Physical and Virtual machines on RHEL 7 or later
For RHEL 7 or later use `firewalld` to simulate a network failure.
In order to effectively simulate a network block (in order to simulate a network failure) both of the following has to be done (merely removing only the high-availability service would have sporadic results):
- Removal of the
high-availabilityservice infirewalld. - Add a drop rule for outbound and inbound for any udp streams where the cluster uses the same local port as remote port (
corosyncport:5405).
Start firewalld if the service is not started.
# systemctl start firewalld
# systemctl status firewalld
Do note that the firewall-cmd commands below are not saving the changes made below to a file, but are modifying the running (in memory) instance of firewalld. When the host reboots (or restarts firewalld) then firewalld will load the saved rules and services before the firewall-cmd commands below was ran.
Cluster member nodes
Verify all the cluster members nodes have joined the cluster.
# corosync-quorumtool
Get the active zone that is currently being used and verify the interface that is being used for corosync is included in that list of interfaces.
# firewall-cmd --get-active-zones
Run the following command to drop the ports used for cluster communication:
# firewall-cmd --zone=<zone> --remove-service=high-availability
Add the rule to drop outbound and inbound for port 5405 for corosync. The cluster node should lost a token and will be fenced when firewalld starts dropping packets for port 5405:
# firewall-cmd --direct --add-rule ipv4 filter OUTPUT 2 -p udp --dport=5405 -j DROP
# firewall-cmd --add-rich-rule='rule family="ipv4" port port="5405" protocol="udp" drop'
To add the high-availability service and verify it was added then run the following command:
# firewall-cmd --zone=<zone> --add-service=high-availability
# firewall-cmd --zone=<zone> --list-all
To remove the rules above and verify they were removed then run the following command:
# firewall-cmd --direct --remove-rule ipv4 filter OUTPUT 2 -p udp --dport=5405 -j DROP
# firewall-cmd --direct --get-all-rules
# firewall-cmd --remove-rich-rule='rule family="ipv4" port port="5405" protocol="udp" drop'
# firewall-cmd --list-rich-rules
Cluster Remote Nodes
Run the following command to drop the ports used for cluster communication:
# firewall-cmd --direct --add-rule ipv4 filter OUTPUT 2 -p tcp --sport=3121 -j DROP
To remove the rule above and verify the rule were removed then run the following commands:
# firewall-cmd --direct --remove-rule ipv4 filter OUTPUT 2 -p tcp --sport=3121 -j DROP
# firewall-cmd --direct --get-all-rules
Example for Cluster Members
In this example the zone is called `public`. We do not permanently (or save) the rules, so the rules that are added below will not be included when the host reboots.
# firewall-cmd --get-active-zones
public
interfaces: ens6 ens3
# firewall-cmd --zone=public --remove-service=high-availability
success
# firewall-cmd --direct --add-rule ipv4 filter OUTPUT 2 -p udp --dport=5405 -j DROP
success
# firewall-cmd --add-rich-rule='rule family="ipv4" port port="5405" protocol="udp" drop'
success
Related Articles
- Support Policies for RHEL High Availability clusters - Transport Protocols
- Support Policies for RHEL High Availability Clusters - Cluster Interconnect Network Interfaces
- Support Policies for RHEL Resilient Storage - Cluster Network Interconnect Interfaces
- RHEL 7
- RHEL 8
- RHEL 9
Root Cause
The service corosync on Red Hat Enterprise Linux (RHEL) 6 is designed to run in an environment where interfaces can be taken out of service but messaging can still continue. When an interface state changes (up/down), occasionally virtual synchrony guarantees are violated for other nodes in the cluster. This could result in a non-operable cluster. More specifically corosync rebinds to the localhost network so it can continue transmitting messages. corosync will ignore the fact that the interface went down and will filter messages up/down the stack, while totem attempts to rebind to the interface periodically.
The service corosync behaves in this manner so that virtual synchrony is maintained (all cluster nodes that were connected to the ifdowned node would see the node as disappeared) and the cluster node that was isolated with ifdown would behave as though the network were still operational on its local nic, but not be able to see other cluster nodes.
Using ifdown or ifup on a network interface that is used for cluster communication is not a valid way of testing how the cluster behaves in the event there is a network failure(and unexpected results will occur using this method).
Openais in RHEL 5 Clustering behaves in a simliar fashion.
For more information see this article on Content from github.com is not included.Corosync and ifdown on active network interface.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.