Regaining a Quorum in a Galera Cluster in OpenStack on Red Hat Enterprise Linux OpenStack Platform 5 to 9

Updated 19 Sept 2018

Galera maintains its own quorum separate from pacemaker. There may be instances where you lose the Galera quorum, such as when rebooting your entire cluster. You have to manually restart services to regain the quorum. This is also called bootstrapping Galera.

Note: in RHEL OSP 6 (and later) the galera resource agent automatically detects loss of quorum, and recovers by applying the bootstrapping procedure below. However, this may fail if it cannot get information from all the galera nodes. In that case the cluster can still be recovered manually if the most recent node can be determined.

Overview

Here is a high-level outline of the steps to re-establish the Galera quorum.

Determine the loss of quorum.
Determine systems with last activity.
Temporarily stop pacemaker control of the database.
Start the first database on first node.
Start the database on remaining nodes.
Re-enable pacemaker control of the database.
Switch to the pacemaker controller database on each node.

Prerequisite

pacemaker-1.1.13 and above is required for proceeding crm_resource --force-promote -r galera -V in this KCS, otherwise we'll get crm_resource: unrecognized option '--force-promote with pacemaker-1.1.12
- Please follow This content is not included.How to recover pacemaker controlled galera-master manually without systemd/pacemaker when galera cluster was in read only mode. instead if you have pacemaker-1.1.12

Detailed Procedures

Determine loss of quorum.

A good first sign is that pacemaker does not have the myslqd-clone service running on a majority of nodes. Invoke the following on any node.

 [root@server3 ~]# pcs status | grep -A 2 mysql
  Clone Set: mysqld-clone [mysqld]
      Started: [ server1.example.com ]
      Stopped: [ server2.example.com server3.example.com ]
 ...  
     mysqld_start_0 on server2.example.com 'OCF_PENDING' (196): call=183, status=complete, last-rc-change='Mon Sep 29 11:25:47 2014', queued=4ms, exec=2001ms
     mysqld_start_0 on server3.example.com 'not running' (7): call=179, status=complete, last-rc-change='Mon Sep 29 11:29:33 2014', queued=4ms, exec=2002ms

Check the /var/log/mariadb/mariadb.log on each system for errors. (Note: If under control of the Pacemaker Cluster Resource Manager, use the /var/log/mysqld.log file instead)

 140929 11:25:40 [ERROR] WSREP: Local state seqno (1399488) is greater than group seqno (10068): states diverged. Aborting to avoid potential data loss. Remove '/var/lib/mysql//grastate.dat' file and restart if you wish to continue. (FATAL)
 140929 11:25:40 [ERROR] Aborting

The clustercheck command checks whether some systems not in sync.

 [root@server2 ~]# clustercheck
 HTTP/1.1 503 Service Unavailable
 Content-Type: text/plain
 Connection: close
 Content-Length: 36

 Galera cluster node is not synced.

Note: From OSP 6 onwards, when pacemaker’s resource agent notices that a galera node has lost quorum, it automatically stops it; Depending on how the resource is configured¹, you may need to cleanup the failure for pacemaker to restart the galera cluster and regain quorum automatically.

Determine systems with last activity.

Determine which system or systems has the highest valid sequence number for the for the latest UUID.

Orderly shutdown. If the cluster shutdown correctly the /var/lib/mysql/grastate.dat will have positive numbers for the seqno. Note which system or systems have the greatest seqno. However, if any system has a -1 value, that indicates the shutdown was not clean and another method to determine the seqno is needed.
```
 [root@server2 ~]# cat /var/lib/mysql/grastate.dat
 # GALERA saved state
 version: 2.1
 uuid:    b048715d-4369-11e4-b7ef-af1999a6c989
 seqno:   -1
 cert_index:
```

Disorderly shutdown. The seqno is in the /var/log/mariadb/mariadb.log on RHOSP5 or /var/log/mysqld.log on RHOSP6 and later. Search for lines with Found save state, ignoring any -1 values. The last value on each line is in the form UUID:seqno.

 [root@server1 ~]# tail -n 1000 /var/log/mariadb/mariadb.log | grep "Found saved state"  | grep -v ":-1"
 140923 17:49:19 [Note] WSREP: Found saved state: b048715d-4369-11e4-b7ef-af1999a6c989:2229
 140924 15:37:13 [Note] WSREP: Found saved state: b048715d-4369-11e4-b7ef-af1999a6c989:2248
 140929 11:24:26 [Note] WSREP: Found saved state: b048715d-4369-11e4-b7ef-af1999a6c989:10060
 [root@server1 ~]#

 [root@server2 ~]# tail -n 1000 /var/log/mariadb/mariadb.log | grep "Found saved state"  | grep -v ":-1"
 140926 14:58:16 [Note] WSREP: Found saved state: b048715d-4369-11e4-b7ef-af1999a6c989:171535
 140929 11:24:28 [Note] WSREP: Found saved state: b048715d-4369-11e4-b7ef-af1999a6c989:1399488
 [root@server2 ~]#

 [root@server3 ~]# tail -n 2000 /var/log/mariadb/mariadb.log | grep "Found saved state"  | grep -v ":-1"
 140923 17:36:57 [Note] WSREP: Found saved state: b048715d-4369-11e4-b7ef-af1999a6c989:36
 140923 17:43:18 [Note] WSREP: Found saved state: b048715d-4369-11e4-b7ef-af1999a6c989:785
 [root@server3 ~]#

Notice all servers have the same UUID (b048715d-4369-11e4-b7ef-af1999a6c989), but server2 has the largest seqno (1399488).
One can also determine this seqno from pacemaker. If Pacemaker previously tried to restart the cluster, this can be retrieved from the CIB, e.g. for node1:
```
 [root@node1 ~]# crm_attribute -N node1 -l reboot --name galera-last-committed -Q
```
If the last seqno is not present in the CIB3, it can be retrieved with MariaDB:
```
 [root@node1 ~]# mysqld_safe --wsrep-recover
 151002 13:59:35 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
 151002 13:59:35 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
 151002 13:59:35 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.2FkYLQ' --pid-file='/var/lib/mysql/db1-recover.pid'
 151002 13:59:50 mysqld_safe WSREP: Recovered position 4c7ba2a8-566a-11e5-8250-1e939ac17c77:9
 151002 13:59:52 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
```
From OSP6 onwards, pacemaker’s resource agent automates these steps so that it can bootstrap the cluster. However, if it cannot retrieve the seqno from all the nodes it will not restart the cluster as it cannot determine the bootstrap node with certainty. If needed, you can override the resource agent’s decision by forcing the bootstrap node manually.

If you need to force bootstrap, make sure to select the node with latest activity otherwise you will lose data. In the example, since server2 had the highest seqno, you can bootstrap the cluster with the following steps:
1. Set the management of galera out of pacemaker's control temporarily.
```
[root@server ~]# pcs resource unmanage galera
```
2. Connect to server2 and run those commands locally to force the node to bootstrap the cluster.
```
[root@server2 ~]# crm_attribute -N server2 -l reboot --name galera-bootstrap -v true
[root@server2 ~]# crm_resource --force-promote -r galera -V
```
3. Then, instruct pacemaker to re-detect the current state of the galera resource. This will clean up failcount and purge knowledge of past failures.
```
[root@server2 ~]# pcs resource cleanup galera
```
4. At this point galera is up and pacemaker knows that it is up. Give back control of galera to pacemaker and monitor the status as the remaining resources are restarted.
```
[root@server2 ~]# pcs resource manage galera
[root@server2 ~]# pcs status
```
For restarting the cluster on OSP 5, continue by applying the following steps:

Temporarily stop pacemaker control of the database.

[root@server3 ~]# pcs resource disable mysqld-clone

Verify no mysqld is running on any cluster member.

[root@server3 ~]# ps -ef| grep mysqld | grep -v grep

Start the first database on the first node.

This initiates the Galera cluster. Since server2 had the highest seqno, that is the node to start first.

[root@server2 ~]# /usr/libexec/mysqld --wsrep-cluster-address='gcomm://' &
[1] 1910
[root@server2 ~]# 140929 16:31:00 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295
140929 16:31:00 [Warning] Could not increase number of max_open_files to more than 1024 (request: 1835)
/usr/libexec/mysqld: Query cache is disabled (resize or similar command in progress); repeat this command later

Verify that this brought the this node into sync.

        [root@server2 ~]# clustercheck
        HTTP/1.1 200 OK
        Content-Type: text/plain
        Connection: close
        Content-Length: 32

        Galera cluster node is synced.

Start database on remaining nodes

On another cluster member, start the database, however use the address of the node that was first started, and then verify this node reports synced.

        [root@server1 ~]#  /usr/libexec/mysqld --wsrep-cluster-address='gcomm://10.19.139.32' &
        [1] 10603
        [root@server1 ~]# 140929 16:34:17 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295
        140929 16:34:17 [Warning] Could not increase number of max_open_files to more than 1024 (request: 1835)
        /usr/libexec/mysqld: Query cache is disabled (resize or similar command in progress); repeat this command later

        [root@server1 ~]# clustercheck
        HTTP/1.1 200 OK
        Content-Type: text/plain
        Connection: close
        Content-Length: 32

        Galera cluster node is synced.
        [root@server1 ~]#

It may take time for the node to sync, so re-issue the clustercheck command if it reports the node is not in sync at first.
Repeat for all remaining cluster nodes.
Re-enable pacemaker control of the database. This command can be run on any node and pacemaker will attempt to start mysqld on each node.
```
[root@server1 ~]# pcs resource enable mysqld-clone
[root@server1 ~]#
```
Switch to the pacemaker controller database on each node.

One node at a time, perform all the following, so that the node has mysqld running under pacemaker before moving to next node:

Stop the mysqld service that was started by hand, and confirm it is stopped.

[root@server1 ~]# kill %%
[root@server1 ~]#
[1]+  Done                    /usr/libexec/mysqld --wsrep-cluster-address='gcomm://1.2.3.4'
[root@server1 ~]#

Pacemaker should attempt to start mysqld. It may fail initially, so wait and repeat clean-ups until is it started in the current node. Once pacemaker has it started, check to confirm it is in sync.

            [root@server1 ~]# pcs status | grep -A 3 mysqld
             Clone Set: mysqld-clone [mysqld]
                 Stopped: [ server1.example.com server2.example.com server3.example.com ]
             ip-1.2.3.4	(ocf::heartbeat:IPaddr2):	Started server2.example.com
             Clone Set: openstack-keystone-clone [openstack-keystone]
            ...
                mysqld_start_0 on server1.example.com 'not running' (7): call=199, status=complete, last-rc-change='Mon Sep 29 16:59:37 2014', queued=10ms, exec=2001ms
                rabbitmq-server_start_0 on server2.example.com 'OCF_PENDING' (196): call=190, status=complete, last-rc-change='Mon Sep 29 11:25:48 2014', queued=4ms, exec=2001ms
                mysqld_start_0 on server2.example.com 'OCF_PENDING' (196): call=183, status=complete, last-rc-change='Mon Sep 29 11:25:47 2014', queued=4ms, exec=2001ms
                mysqld_start_0 on server3.example.com 'not running' (7): call=179, status=complete, last-rc-change='Mon Sep 29 11:29:33 2014', queued=4ms, exec=2002ms

            PCSD Status:
            [root@server1 ~]# pcs resource cleanup mysqld
            Resource: mysqld successfully cleaned up
            [root@server1 ~]# pcs status | grep -A 3 mysqld
             Clone Set: mysqld-clone [mysqld]
                 Started: [ server1.example.com ]
                 Stopped: [ server2.example.com server3.example.com ]
             ip-1.2.3.4	(ocf::heartbeat:IPaddr2):	Started server2.example.com
            ...
                mysqld_start_0 on server2.example.com 'not running' (7): call=206, status=complete, last-rc-change='Mon Sep 29 17:09:53 2014', queued=11ms, exec=2001ms
                mysqld_start_0 on server3.example.com 'not running' (7): call=187, status=complete, last-rc-change='Mon Sep 29 17:09:53 2014', queued=10ms, exec=2002ms

            PCSD Status:
            [root@server1 ~]# clustercheck
            HTTP/1.1 200 OK
            Content-Type: text/plain
            Connection: close
            Content-Length: 32

            Galera cluster node is synced.
            [root@server1 ~]#

Repeat for all remaining cluster nodes, one at a time.

See the on-fail property of resource operations: <Content from clusterlabs.org is not included.http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Pacemaker_Explained/_resource_operations.html>

Product(s)

Red Hat OpenStack Platform

Category

Configure

Article Type

General