Token is lost and cman killed on multiple cluster nodes when a time jump occurs in RHEL 5
Environment
- Red Hat Enterprise Linux Server 5 (with the High Availability Add on)
openaisprior to 0.80.6-28.el5_6.4 or prior to release 0.80.6-30.el5_7.4
Issue
- The token was lost in the cluster, both nodes in the two-node cluster attempted to rejoin, and cman on both nodes was killed without either node being fenced.
- System messages show a token loss immediately after ntpd does a time reset.
- Why did the
ntpddaemon cause a shutdown on a cluster built on Red Hat Enterprise Linux 5.6? - Why does my node reboot when ntp causes a large time jump on the system?
Resolution
Update openais to one of the following package versions (or a later build for the same minor version):
To work around the issue, configure the systems in such a way that the time source is stable and is not capable of jumping large amounts in either direction. If the issue is caused by the hardware clock not being in sync with the ntp time source, the following steps may provide some relief from the issue:
1. Synchronize the system clock to the hardware clock now:
# hwclock --systohc
2. Add the -x option for ntpd in /etc/sysconfig/ntpd. This will cause ntpd to slew time differences smaller than 10 minutes rather than incrementing the time. If the time difference is greater than 10 minutes, ntpd will still step the time.
Change this: OPTIONS="-u ntp:ntp -p /var/run/ntpd.pid"
To this: OPTIONS="-u ntp:ntp -x -p /var/run/ntpd.pid"
3. Enable syncing of the hardware clock when ntpd starts. In /etc/sysconfig/ntpd:
Change this: SYNC_HWCLOCK=no
To this: SYNC_HWCLOCK=yes
4. For long-running systems, set up a cron job to sync the hardware clock to the ntp servers once a day. This is relevant because SYNC_HWCLOCK only syncs the hardware clock once, when ntpd starts. Implementing the following cron job will prevent subsequent hardware clock drift from becoming an issue.
Open the crontab editor for the root user:
# crontab -u root -e
Add the following entry, which will sync the hardware clock with the system clock at 00:01 every day.
01 0 * * * /sbin/hwclock --systohc
Root Cause
OpenAIS in RHEL 5 utilizes timing routines that are based on the system time. Sudden and unexpected time jumps (of even only a couple of seconds) can throw OpenAIS's time-dependant routines off. Possible sources of such jumps may be incorrect NTP configurations or time-drift on virtualized systems. As much of the cluster infrastructure is quite time sensitive these timing inconsistencies can result in nodes leaving the cluster and failing to fence. In RHEL 6 monotonic timers were introduced that do not suffer this same instability. The code fix for this issue involves a back-port of the RHEL 6 monotonic timer to the RHEL 5 OpenAIS.
A time jump occurring on one of the two nodes can cause a token loss in which the configuration changes, but no fencing action is carried out. When the nodes enter the GATHER state (sending their join messages out to rest of the cluster and forming consensus about the membership), they both see each other and attempt to rejoin the cluster. However, both nodes recognize that the other node has been removed from the cluster and still has its dirty flag set, and thus cannot be allowed back in the cluster. As a result, both nodes simultaneously kill each other. With no nodes left in the cluster, no fencing can take place and both nodes remain in this "dead" state, until an administrator intervenes and restarts the nodes by rebooting them.
It is difficult to define an exact minimum amount of time to be covered in a jump that will result in such behavior, but in general, anything near totem token should be cause for concern that it may trigger this problem. Any time jump that can cause the last token received to be seen as having occurred more than totem token milliseconds ago can lead to the cluster concluding that a membership change has occurred. As such, time jumps less than totem token may result in this issue; if the system is vulnerable to such jumps, then the above mentioned corrective measures should be taken.
Diagnostic Steps
The logs on both nodes will show several key events:
- Token is lost
- Membership changes to have the other node in the "left" list
- Membership changes to have the other node in the "joined" list
- Declaration that the other node has rejoined with a restart, and must be killed (may not be seen on both nodes)
- Declaration that the other node has killed us for joining without a restart (may not be seen on both nodes)
One known trigger for the issue looks like this (ntpd "time reset"):
Aug 9 07:50:29 node1 ntpd[11594]: time reset +23.999497 s
The full logs from the issue may look similar to the following:
Aug 9 07:50:29 node1 ntpd[11594]: time reset +23.999497 s
Aug 9 07:50:29 node1 openais[27597]: [totemsrp.c:1441] The token was lost in the OPERATIONAL state.
Aug 9 07:50:29 node1 openais[27597]: [totemnet.c:1005] Receive multicast socket recv buffer size (320000 bytes).
Aug 9 07:50:29 node1 openais[27597]: [totemnet.c:1011] Transmit multicast socket send buffer size (320000 bytes).
Aug 9 07:50:29 node1 openais[27597]: [totemsrp.c:1765] entering GATHER state from 2.
Aug 9 07:50:31 node1 openais[27597]: [totemsrp.c:1765] entering GATHER state from 0.
Aug 9 07:50:31 node1 openais[27597]: [totemsrp.c:2813] Creating commit token because I am the rep.
Aug 9 07:50:31 node1 openais[27597]: [totemsrp.c:2979] Storing new sequence id for ring 37c
Aug 9 07:50:31 node1 openais[27597]: [totemsrp.c:1804] entering COMMIT state.
Aug 9 07:50:31 node1 openais[27597]: [totemsrp.c:1833] entering RECOVERY state.
Aug 9 07:50:31 node1 openais[27597]: [totemsrp.c:1864] position [0] member 10.0.0.52:
Aug 9 07:50:31 node1 openais[27597]: [totemsrp.c:1868] previous ring seq 888 rep 10.0.0.51
Aug 9 07:50:31 node1 openais[27597]: [totemsrp.c:1874] aru 255 high delivered 255 received flag 1
Aug 9 07:50:31 node1 openais[27597]: [totemsrp.c:1973] Did not need to originate any messages in recovery.
Aug 9 07:50:31 node1 openais[27597]: [totemsrp.c:4128] Sending initial ORF token
Aug 9 07:50:31 node1 openais[27597]: [clm.c:0513] CLM CONFIGURATION CHANGE
Aug 9 07:50:31 node1 openais[27597]: [clm.c:0514] New Configuration:
Aug 9 07:50:31 node1 openais[27597]: [clm.c:0516] r(0) ip(10.0.0.52)
Aug 9 07:50:31 node1 openais[27597]: [clm.c:0518] Members Left:
Aug 9 07:50:31 node1 openais[27597]: [clm.c:0520] r(0) ip(10.0.0.51)
Aug 9 07:50:31 node1 openais[27597]: [clm.c:0523] Members Joined:
Aug 9 07:50:31 node1 openais[27597]: [clm.c:0513] CLM CONFIGURATION CHANGE
Aug 9 07:50:31 node1 openais[27597]: [clm.c:0514] New Configuration:
Aug 9 07:50:31 node1 openais[27597]: [clm.c:0516] r(0) ip(10.0.0.52)
Aug 9 07:50:31 node1 openais[27597]: [clm.c:0518] Members Left:
Aug 9 07:50:31 node1 openais[27597]: [clm.c:0523] Members Joined:
Aug 9 07:50:31 node1 openais[27597]: [sync.c:0321] This node is within the primary component and will provide service.
Aug 9 07:50:31 node1 openais[27597]: [totemsrp.c:1710] entering OPERATIONAL state.
Aug 9 07:50:31 node1 openais[27597]: [clm.c:0595] got nodejoin message 10.0.0.52
Aug 9 07:50:31 node1 kernel: dlm: closing connection to node 1
Aug 9 07:50:31 node1 openais[27597]: [cpg.c:0822] got joinlist message from node 2
Aug 9 07:50:32 node1 openais[27597]: [totemsrp.c:1765] entering GATHER state from 11.
Aug 9 07:50:32 node1 openais[27597]: [totemsrp.c:2979] Storing new sequence id for ring 380
Aug 9 07:50:32 node1 openais[27597]: [totemsrp.c:1804] entering COMMIT state.
Aug 9 07:50:32 node1 openais[27597]: [totemsrp.c:1833] entering RECOVERY state.
Aug 9 07:50:32 node1 openais[27597]: [totemsrp.c:1864] position [0] member 10.0.0.51:
Aug 9 07:50:32 node1 openais[27597]: [totemsrp.c:1868] previous ring seq 892 rep 10.0.0.51
Aug 9 07:50:32 node1 openais[27597]: [totemsrp.c:1874] aru 13 high delivered 13 received flag 1
Aug 9 07:50:32 node1 openais[27597]: [totemsrp.c:1864] position [1] member 10.0.0.52:
Aug 9 07:50:32 node1 openais[27597]: [totemsrp.c:1868] previous ring seq 892 rep 10.0.0.52
Aug 9 07:50:32 node1 openais[27597]: [totemsrp.c:1874] aru 13 high delivered 13 received flag 1
Aug 9 07:50:32 node1 openais[27597]: [totemsrp.c:1973] Did not need to originate any messages in recovery.
Aug 9 07:50:32 node1 openais[27597]: [clm.c:0513] CLM CONFIGURATION CHANGE
Aug 9 07:50:32 node1 openais[27597]: [clm.c:0514] New Configuration:
Aug 9 07:50:32 node1 openais[27597]: [clm.c:0516] r(0) ip(10.0.0.52)
Aug 9 07:50:32 node1 openais[27597]: [clm.c:0518] Members Left:
Aug 9 07:50:32 node1 openais[27597]: [clm.c:0523] Members Joined:
Aug 9 07:50:32 node1 openais[27597]: [clm.c:0513] CLM CONFIGURATION CHANGE
Aug 9 07:50:32 node1 openais[27597]: [clm.c:0514] New Configuration:
Aug 9 07:50:32 node1 openais[27597]: [clm.c:0516] r(0) ip(10.0.0.51)
Aug 9 07:50:32 node1 openais[27597]: [clm.c:0516] r(0) ip(10.0.0.52)
Aug 9 07:50:32 node1 openais[27597]: [clm.c:0518] Members Left:
Aug 9 07:50:32 node1 openais[27597]: [clm.c:0523] Members Joined:
Aug 9 07:50:32 node1 openais[27597]: [clm.c:0525] r(0) ip(10.0.0.51)
Aug 9 07:50:32 node1 openais[27597]: [sync.c:0321] This node is within the primary component and will provide service.
Aug 9 07:50:32 node1 openais[27597]: [totemsrp.c:1710] entering OPERATIONAL state.
Aug 9 07:50:32 node1 openais[27597]: [commands.c:1769] Killing node internal_node1 because it has rejoined the cluster with existing state
Aug 9 07:50:32 node1 openais[27597]: [logging.c:0042] cman killed by node 1 because we rejoined the cluster without a full restart
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.