Regaining a Quorum in a Galera Cluster in OpenStack on Red Hat Enterprise Linux OpenStack Platform 5 to 9
Galera maintains its own quorum separate from pacemaker. There may be instances where you lose the Galera quorum, such as when rebooting your entire cluster. You have to manually restart services to regain the quorum. This is also called bootstrapping Galera.
Note: in RHEL OSP 6 (and later) the galera resource agent automatically detects loss of quorum, and recovers by applying the bootstrapping procedure below. However, this may fail if it cannot get information from all the galera nodes. In that case the cluster can still be recovered manually if the most recent node can be determined.
Overview
Here is a high-level outline of the steps to re-establish the Galera quorum.
- Determine the loss of quorum.
- Determine systems with last activity.
- Temporarily stop pacemaker control of the database.
- Start the first database on first node.
- Start the database on remaining nodes.
- Re-enable pacemaker control of the database.
- Switch to the pacemaker controller database on each node.
Prerequisite
- pacemaker-1.1.13 and above is required for proceeding
crm_resource --force-promote -r galera -Vin this KCS, otherwise we'll getcrm_resource: unrecognized option '--force-promotewith pacemaker-1.1.12- Please follow This content is not included.How to recover pacemaker controlled galera-master manually without systemd/pacemaker when galera cluster was in read only mode. instead if you have pacemaker-1.1.12
Detailed Procedures
-
Determine loss of quorum.
-
A good first sign is that pacemaker does not have the myslqd-clone service running on a majority of nodes. Invoke the following on any node.
[root@server3 ~]# pcs status | grep -A 2 mysql Clone Set: mysqld-clone [mysqld] Started: [ server1.example.com ] Stopped: [ server2.example.com server3.example.com ] ... mysqld_start_0 on server2.example.com 'OCF_PENDING' (196): call=183, status=complete, last-rc-change='Mon Sep 29 11:25:47 2014', queued=4ms, exec=2001ms mysqld_start_0 on server3.example.com 'not running' (7): call=179, status=complete, last-rc-change='Mon Sep 29 11:29:33 2014', queued=4ms, exec=2002ms -
Check the
/var/log/mariadb/mariadb.logon each system for errors. (Note: If under control of the Pacemaker Cluster Resource Manager, use the/var/log/mysqld.logfile instead)140929 11:25:40 [ERROR] WSREP: Local state seqno (1399488) is greater than group seqno (10068): states diverged. Aborting to avoid potential data loss. Remove '/var/lib/mysql//grastate.dat' file and restart if you wish to continue. (FATAL) 140929 11:25:40 [ERROR] Aborting -
The
clustercheckcommand checks whether some systems not in sync.[root@server2 ~]# clustercheck HTTP/1.1 503 Service Unavailable Content-Type: text/plain Connection: close Content-Length: 36 Galera cluster node is not synced.
Note: From OSP 6 onwards, when pacemaker’s resource agent notices that a galera node has lost quorum, it automatically stops it; Depending on how the resource is configured1, you may need to cleanup the failure for pacemaker to restart the galera cluster and regain quorum automatically.
-
-
Determine systems with last activity.
-
Determine which system or systems has the highest valid sequence number for the for the latest UUID.
-
Orderly shutdown. If the cluster shutdown correctly the
/var/lib/mysql/grastate.datwill have positive numbers for theseqno. Note which system or systems have the greatest seqno. However, if any system has a-1value, that indicates the shutdown was not clean and another method to determine the seqno is needed.[root@server2 ~]# cat /var/lib/mysql/grastate.dat # GALERA saved state version: 2.1 uuid: b048715d-4369-11e4-b7ef-af1999a6c989 seqno: -1 cert_index: -
Disorderly shutdown. The seqno is in the
/var/log/mariadb/mariadb.logon RHOSP5 or/var/log/mysqld.logon RHOSP6 and later. Search for lines withFound save state, ignoring any-1values. The last value on each line is in the form UUID:seqno.[root@server1 ~]# tail -n 1000 /var/log/mariadb/mariadb.log | grep "Found saved state" | grep -v ":-1" 140923 17:49:19 [Note] WSREP: Found saved state: b048715d-4369-11e4-b7ef-af1999a6c989:2229 140924 15:37:13 [Note] WSREP: Found saved state: b048715d-4369-11e4-b7ef-af1999a6c989:2248 140929 11:24:26 [Note] WSREP: Found saved state: b048715d-4369-11e4-b7ef-af1999a6c989:10060 [root@server1 ~]# [root@server2 ~]# tail -n 1000 /var/log/mariadb/mariadb.log | grep "Found saved state" | grep -v ":-1" 140926 14:58:16 [Note] WSREP: Found saved state: b048715d-4369-11e4-b7ef-af1999a6c989:171535 140929 11:24:28 [Note] WSREP: Found saved state: b048715d-4369-11e4-b7ef-af1999a6c989:1399488 [root@server2 ~]# [root@server3 ~]# tail -n 2000 /var/log/mariadb/mariadb.log | grep "Found saved state" | grep -v ":-1" 140923 17:36:57 [Note] WSREP: Found saved state: b048715d-4369-11e4-b7ef-af1999a6c989:36 140923 17:43:18 [Note] WSREP: Found saved state: b048715d-4369-11e4-b7ef-af1999a6c989:785 [root@server3 ~]# -
Notice all servers have the same UUID (
b048715d-4369-11e4-b7ef-af1999a6c989), butserver2has the largestseqno(1399488). -
One can also determine this seqno from pacemaker. If Pacemaker previously tried to restart the cluster, this can be retrieved from the CIB, e.g. for node1:
[root@node1 ~]# crm_attribute -N node1 -l reboot --name galera-last-committed -Q -
If the last seqno is not present in the CIB3, it can be retrieved with MariaDB:
[root@node1 ~]# mysqld_safe --wsrep-recover 151002 13:59:35 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'. 151002 13:59:35 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 151002 13:59:35 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.2FkYLQ' --pid-file='/var/lib/mysql/db1-recover.pid' 151002 13:59:50 mysqld_safe WSREP: Recovered position 4c7ba2a8-566a-11e5-8250-1e939ac17c77:9 151002 13:59:52 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid endedFrom OSP6 onwards, pacemaker’s resource agent automates these steps so that it can bootstrap the cluster. However, if it cannot retrieve the
seqnofrom all the nodes it will not restart the cluster as it cannot determine the bootstrap node with certainty. If needed, you can override the resource agent’s decision by forcing the bootstrap node manually.If you need to force bootstrap, make sure to select the node with latest activity otherwise you will lose data. In the example, since
server2had the highestseqno, you can bootstrap the cluster with the following steps:-
Set the management of galera out of pacemaker's control temporarily.
[root@server ~]# pcs resource unmanage galera -
Connect to server2 and run those commands locally to force the node to bootstrap the cluster.
[root@server2 ~]# crm_attribute -N server2 -l reboot --name galera-bootstrap -v true [root@server2 ~]# crm_resource --force-promote -r galera -V -
Then, instruct pacemaker to re-detect the current state of the galera resource. This will clean up failcount and purge knowledge of past failures.
[root@server2 ~]# pcs resource cleanup galera -
At this point galera is up and pacemaker knows that it is up. Give back control of galera to pacemaker and monitor the status as the remaining resources are restarted.
[root@server2 ~]# pcs resource manage galera [root@server2 ~]# pcs status
For restarting the cluster on OSP 5, continue by applying the following steps:
-
-
Temporarily stop pacemaker control of the database.
[root@server3 ~]# pcs resource disable mysqld-clone -
Verify no mysqld is running on any cluster member.
[root@server3 ~]# ps -ef| grep mysqld | grep -v grep -
Start the first database on the first node.
-
This initiates the Galera cluster. Since
server2had the highest seqno, that is the node to start first.[root@server2 ~]# /usr/libexec/mysqld --wsrep-cluster-address='gcomm://' & [1] 1910 [root@server2 ~]# 140929 16:31:00 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295 140929 16:31:00 [Warning] Could not increase number of max_open_files to more than 1024 (request: 1835) /usr/libexec/mysqld: Query cache is disabled (resize or similar command in progress); repeat this command later -
Verify that this brought the this node into sync.
[root@server2 ~]# clustercheck HTTP/1.1 200 OK Content-Type: text/plain Connection: close Content-Length: 32 Galera cluster node is synced. -
Start database on remaining nodes
-
On another cluster member, start the database, however use the address of the node that was first started, and then verify this node reports synced.
[root@server1 ~]# /usr/libexec/mysqld --wsrep-cluster-address='gcomm://10.19.139.32' & [1] 10603 [root@server1 ~]# 140929 16:34:17 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295 140929 16:34:17 [Warning] Could not increase number of max_open_files to more than 1024 (request: 1835) /usr/libexec/mysqld: Query cache is disabled (resize or similar command in progress); repeat this command later [root@server1 ~]# clustercheck HTTP/1.1 200 OK Content-Type: text/plain Connection: close Content-Length: 32 Galera cluster node is synced. [root@server1 ~]# -
It may take time for the node to sync, so re-issue the clustercheck command if it reports the node is not in sync at first.
-
Repeat for all remaining cluster nodes.
-
Re-enable pacemaker control of the database. This command can be run on any node and pacemaker will attempt to start
mysqldon each node.[root@server1 ~]# pcs resource enable mysqld-clone [root@server1 ~]# -
Switch to the pacemaker controller database on each node.
-
One node at a time, perform all the following, so that the node has
mysqldrunning under pacemaker before moving to next node:-
Stop the
mysqldservice that was started by hand, and confirm it is stopped.[root@server1 ~]# kill %% [root@server1 ~]# [1]+ Done /usr/libexec/mysqld --wsrep-cluster-address='gcomm://1.2.3.4' [root@server1 ~]# -
Pacemaker should attempt to start
mysqld. It may fail initially, so wait and repeat clean-ups until is it started in the current node. Once pacemaker has it started, check to confirm it is in sync.[root@server1 ~]# pcs status | grep -A 3 mysqld Clone Set: mysqld-clone [mysqld] Stopped: [ server1.example.com server2.example.com server3.example.com ] ip-1.2.3.4 (ocf::heartbeat:IPaddr2): Started server2.example.com Clone Set: openstack-keystone-clone [openstack-keystone] ... mysqld_start_0 on server1.example.com 'not running' (7): call=199, status=complete, last-rc-change='Mon Sep 29 16:59:37 2014', queued=10ms, exec=2001ms rabbitmq-server_start_0 on server2.example.com 'OCF_PENDING' (196): call=190, status=complete, last-rc-change='Mon Sep 29 11:25:48 2014', queued=4ms, exec=2001ms mysqld_start_0 on server2.example.com 'OCF_PENDING' (196): call=183, status=complete, last-rc-change='Mon Sep 29 11:25:47 2014', queued=4ms, exec=2001ms mysqld_start_0 on server3.example.com 'not running' (7): call=179, status=complete, last-rc-change='Mon Sep 29 11:29:33 2014', queued=4ms, exec=2002ms PCSD Status: [root@server1 ~]# pcs resource cleanup mysqld Resource: mysqld successfully cleaned up [root@server1 ~]# pcs status | grep -A 3 mysqld Clone Set: mysqld-clone [mysqld] Started: [ server1.example.com ] Stopped: [ server2.example.com server3.example.com ] ip-1.2.3.4 (ocf::heartbeat:IPaddr2): Started server2.example.com ... mysqld_start_0 on server2.example.com 'not running' (7): call=206, status=complete, last-rc-change='Mon Sep 29 17:09:53 2014', queued=11ms, exec=2001ms mysqld_start_0 on server3.example.com 'not running' (7): call=187, status=complete, last-rc-change='Mon Sep 29 17:09:53 2014', queued=10ms, exec=2002ms PCSD Status: [root@server1 ~]# clustercheck HTTP/1.1 200 OK Content-Type: text/plain Connection: close Content-Length: 32 Galera cluster node is synced. [root@server1 ~]#
-
-
Repeat for all remaining cluster nodes, one at a time.
-
See the on-fail property of resource operations: <Content from clusterlabs.org is not included.http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Pacemaker_Explained/_resource_operations.html>