Is there a way to limit multipath failover times in order to avoid Oracle RAC cluster evictions?
Environment
- Red Hat Enterprise Linux 6
- Red Hat Enterprise Linux 7
- Red Hat Enterprise Linux 8
- Oracle RAC
- Fibre Channel SAN storage
- Exceptions:
Issue
- Multipath takes too long to react during SAN failures, exceeding Oracle RAC cluster timeouts and triggering evictions.
The voting disk timeout is 200s
The network heartbeat is 30s (css_misscount)
The SDTO is 27s (short disk timeout)
The SDTO is not public-ally known
Network Ping Disk Ping Reboot
Completes within misscount seconds Completes within Misscount seconds N
Completes within Misscount seconds Takes more than misscount seconds but less than Disktimeout seconds N
Completes within Misscount seconds Takes more than Disktimeout seconds Y
Takes more than Misscount Seconds Completes within Misscount seconds Y
These messages show the SDTO
[ CSSD][xxxxxxx]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
More than disk timeout of 27000 after the last NHB (network heartbeat)
Resolution
Note: The following recommendations are provided as an initial configuration for tuning Red Hat Enterprise Linux components in combination with Oracle RAC clusters. The combination of the settings documented below has been found to work in a series of cases, however, we advise testing these configurations as fine-tuning may be required to match specific environments.
-
The following settings in
/etc/multipath.confcan limit the time required to detect and handle several types of path failures (mainly the ones detected by multipath checkers and the ones detected at the FC layer):defaults { user_friendly_names yes find_multipaths yes max_fds 8192 checker_timeout 5 polling_interval 5 # FN.1 no_path_retry 2 dev_loss_tmo 10 # FN.2 >= polling_interval * no_path_retry fast_io_fail_tmo 5 }- Footnotes
-
FN.1 : Note: In environments with large numbers (e.g. thousands) of LUNs and paths,
polling_intervalneeds to be increased (e.g. to 10). In case of increasingpolling_interval, the minimum value ofdev_loss_tmowill also change. More details can be found in the next note. -
FN.2 : Note: In recent versions of device-mapper-multipath, the minimum possible value of
dev_loss_tmoisno_path_retry * polling_interval. This is described in more detail in the article: Unable to set custom 'dev_loss_tmo' value in RHEL7 . The value of 'dev_loss_tmo' is a trade-off between reducing the overhead of continuously monitoring failed paths and the load of re-detecting deleted paths once they re-appear. There are situations (usually when there is a very large number of paths and re-detecting deleted paths can cause a non-negligible load spike) in which it is better to have an increaseddev_loss_tmoin order to avoid deleting and re-detecting paths. Thedev_loss_tmovalue should never be set to a value equal to or greater than the device'sio_timeoutvalue. Testing will be required in order to conclude which is the most suitable value.
-
-
After making changes in
multipath.conf, a reload of themultipathdservice is needed to make the changes effective.# service multipathd reload -
If multipath is included in the initramfs, then Rebuilding the initramfs and rebooting with the new initramfs is also needed. The reboot is required to make sure that the changes are successfully applied at boot time. For the current operating state, reloading the
multipathdservice is enough.# cp /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.$(date +%m-%d-%H%M%S).bak <--- first create a backup of the existing initamfs # dracut -f -v <---- this is for rebuilding the initramfs -
The parameters can be verified using:
-
fast_io_fail_tmo
# for f in /sys/class/fc_remote_ports/rport-*/fast_io_fail_tmo; do d=$(dirname $f); echo $(basename $d):$(cat $d/node_name):$(cat $f); done -
dev_loss_tmo
# for f in /sys/class/fc_remote_ports/rport-*/dev_loss_tmo; do d=$(dirname$f); echo $(basename $d):$(cat $d/node_name):$(cat $f); done
-
-
More details on these parameters can be found in:
-
The man page of multipath.conf(5) -
man multipath.conf
-
- Footnotes
-
The following settings limit the time required for the scsi layer to timeout and to complete error handling:
-
scsi command timeout: This can be changed using a udev rule. It is not recommended to have this lower than 3 seconds in any circumstances (5 for iscsi) otherwise error handling can start getting triggered even when it is not necessary. The default value is 30 in recent Red Hat Enterprise Linux releases. A value of 20 can be a starting point. It is good to confirm with the storage vendor that the storage array can handle values lower than the default. This can be set by creating (or modifying it if it already exists) the file
/etc/udev/rules.d/90-scsitimeout.rules, with the following content:# cat 90-scsitimeout.rules <--- create this file or modify existing ACTION=="add", SUBSYSTEM=="scsi" , SYSFS{type}=="0|7|14", \ RUN+="/bin/sh -c 'echo 20 > /sys$$DEVPATH/timeout'" -
Error handling timeouts: These are controlled by two parameters:
eh_timeoutandeh_deadline. It is recommended to leaveeh_timeoutat the default value of 10.eh_deadlineis not set by default. A starting value can be 15. It can be set using the following loop:# for i in $(ls /sys/class/fc_host/); do echo 15 > /sys/class/scsi_host/$i/eh_deadline; doneAlternatively, a custom udev rule can be used (in order to have the settings applied during boot), as described in the article: How to set eh_deadline and eh_timeout persistently, using a udev rule
-
The change can be confirmed with the following loop:
# for i in $(ls /sys/class/fc_host/); do echo $i ; cat /sys/class/scsi_host/$i/eh_deadline; done <-- number of hosts would vary in your environment. host0 15 host1 15 -
More details on the timeouts mentioned in this section, can be found in the article: Limiting path failover time for SCSI devices
-
-
Tuning at application layer:
Note: The following settings are related to a 3rd party product (Oracle RAC). You will need to reach out to the 3rd party vendor (Oracle) for more details.
-
css misscountneeds to be increased (e.g. to 90s). Oracle RAC defaults to a 28s miss-count(network heartbeat) which is too tight compared to the timeouts discussed in points 1. and 2. When a server is in recovery and the Oacle binaries are on the SAN, while multipath is blocked in recovery this can exceed the Oracle miss-count. The network heartbeat daemons will not run while the filesystem is blocked and this can cause evictions. Setting the Oracle miss-count to 90s in order to align with timeouts in lower layers is a common setting. -
_asm_hbeatiowaitneeds to be increased (e.g. to 120s). The default value of 15 would cause ASM heartbeat timeout on ASM disk during path failover or the scsi error handling process. This would cause an eviction of the cluster node before multipath would be able to react. -
Ensure that all the package versions in cluster nodes are at same version.
-
Also, see the following on additional details for shortening timeout failover to surviving paths in a fibre channel environment:
- How to set dev_loss_tmo and fast_io_fail_tmo persistently, using a udev rule
- Multipath is not detecting path failures fast enough which results in application failure and system reboots
To lengthen timeout failure to help prevent filesystems entering read-only mode:
Root Cause
Default timeouts configured at the application layer are too low, combined with very relaxed default timeouts at lower layers within the Operating System. As a result, the application (Oracle RAC) evicts nodes before the OS completes error handling procedures (sometimes even before error handling starts), while the OS is still waiting for timeouts to expire, in order to take action.
Diagnostic Steps
-
Such events frequently leave few traces in the logs of the evicted node because the eviction happens too soon, before the OS detects the problem.
-
If the underlying problem that caused the eviction affected also nodes that were not evicted, then logs in the nodes that remained online can provide additional details
-
In some cases, the event is detected also by the evicted node, but the corresponding messages are still in memory at the moment of the eviction and are lost due to the reboot. In such cases, a serial console can be of help. If it is required to enable serial console logging, it is recommended to set the serial console log level to 4. This avoids a known issue involving excessive logging that leads to a system hang. This can be done by:
# dmesg -n 4Or by setting it persistently in grub.conf using the loglevel directive.
-
Cluster logs can reveal why the cluster decided to evict the node.
-
Reviewing the timeouts at the OS level as described in the "Resolution" section and comparing them to the timeouts in the application can reveal the mismatch.
-
If none of the steps above provides any valuable information, then it is possible to trigger a crash instead of powercycling the node during a cluster eviction as described in: How to capture kernel crash dump (vmcore) upon Oracle RAC 11gR2 or Oracle RAC 12 node eviction ? and How to capture crash dump (vmcore) upon Oracle RAC 11g R1 (or earlier) node eviction ?. The aim is to generate a vmcore in order to understand the system state at the moment of the eviction.
-
Additional (performance related) tuning that can affect the functionality of the cluster and the database can be found in: Tuning Red Hat Enterprise Linux for Oracle and Oracle RAC performance .
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.