rgmanager blocks or is unable to manage services when using Redundant Ring Protocol (RRP) in a RHEL 6 Update 3 or earlier High Availability cluster
Environment
- Red Hat Enterprise Linux (RHEL) 6 with the High Availability Add On
cman,clusterlibreleases starting with3.0.12.1-23.el6rgmanagerreleases prior to3.0.12.1-17.el6- See this solution for a similar issue on later releases of
rgmanager
- See this solution for a similar issue on later releases of
- Cluster configured to use Redundant Ring Protocol (RRP)
<altname/>for each node in/etc/cluster/cluster.conf
Issue
- When I start
rgmanagerit just hangs:
May 15 03:54:51 node1 kernel: INFO: task rgmanager:2420 blocked for more than 120 seconds.
May 15 03:54:51 node1 kernel: Not tainted 2.6.32-431.3.1.el6.x86_64 #1
May 15 03:54:51 node1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 15 03:54:51 node1 kernel: rgmanager D 0000000000000002 0 2420 2418 0x00000000
May 15 03:54:51 node1 kernel: ffff880336a93c98 0000000000000082 0000000000000000 ffff880336a93c5c
May 15 03:54:51 node1 kernel: ffff880300000000 ffff88033fc23480 ffff880028296840 0000000000000200
May 15 03:54:51 node1 kernel: ffff8803356425f8 ffff880336a93fd8 000000000000fbc8 ffff8803356425f8
May 15 03:54:51 node1 kernel: Call Trace:
May 15 03:54:51 node1 kernel: [<ffffffff815287c5>] schedule_timeout+0x215/0x2e0
May 15 03:54:51 node1 kernel: [<ffffffff81527920>] ? thread_return+0x4e/0x76e
May 15 03:54:51 node1 kernel: [<ffffffff81285392>] ? kobject_uevent_env+0x202/0x620
May 15 03:54:51 node1 kernel: [<ffffffff81528443>] wait_for_common+0x123/0x180
May 15 03:54:51 node1 kernel: [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
May 15 03:54:51 node1 kernel: [<ffffffff8152855d>] wait_for_completion+0x1d/0x20
May 15 03:54:51 node1 kernel: [<ffffffffa022ef79>] dlm_new_lockspace+0x999/0xa30 [dlm]
May 15 03:54:51 node1 kernel: [<ffffffffa0236ff1>] device_write+0x311/0x720 [dlm]
May 15 03:54:51 node1 kernel: [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
May 15 03:54:51 node1 kernel: [<ffffffff812263d6>] ? security_file_permission+0x16/0x20
May 15 03:54:51 node1 kernel: [<ffffffff81188f88>] vfs_write+0xb8/0x1a0
May 15 03:54:51 node1 kernel: [<ffffffff81189881>] sys_write+0x51/0x90
May 15 03:54:51 node1 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
- I can't manage any services with
clusvcadmorConga
Resolution
Update to cman and clusterlib release 3.0.12.1-49.el6 or later, and to rgmanager release 3.0.12.1-17.el6 or later.
Root Cause
cman/clusterlib added support for Redundant Ring Protocol (RRP) in RHEL 6 Update 2 as a Technology Preview, however it was known to have several issues and was not considered complete and supported until RHEL 6 Update 4. Similarly, rgmanager did not add full support for the RRP use case until RHEL 6 Update 4. For rgmanager to support RRP, it needed an alternative to DLM, which is provided in cpglockd. However cpglockd was not shipped until that RHEL 6 Update 4 release of rgmanager, which meant that rgmanager would still attempt to create a DLM lockspace on startup, which can hang with RRP.
So, updating to RHEL 6 Update 4 provides cpglockd, which rgmanager will start automatically in RRP configurations and use for locking instead of DLM.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.