Regaining a Quorum in a Galera Cluster in OpenStack on Red Hat Enterprise Linux OpenStack Platform 5 to 9

Updated

Galera maintains its own quorum separate from pacemaker. There may be instances where you lose the Galera quorum, such as when rebooting your entire cluster. You have to manually restart services to regain the quorum. This is also called bootstrapping Galera.

Note: in RHEL OSP 6 (and later) the galera resource agent automatically detects loss of quorum, and recovers by applying the bootstrapping procedure below. However, this may fail if it cannot get information from all the galera nodes. In that case the cluster can still be recovered manually if the most recent node can be determined.

Overview

Here is a high-level outline of the steps to re-establish the Galera quorum.

  1. Determine the loss of quorum.
  2. Determine systems with last activity.
  3. Temporarily stop pacemaker control of the database.
  4. Start the first database on first node.
  5. Start the database on remaining nodes.
  6. Re-enable pacemaker control of the database.
  7. Switch to the pacemaker controller database on each node.

Prerequisite

Detailed Procedures

  1. Determine loss of quorum.

    1. A good first sign is that pacemaker does not have the myslqd-clone service running on a majority of nodes. Invoke the following on any node.

       [root@server3 ~]# pcs status | grep -A 2 mysql
        Clone Set: mysqld-clone [mysqld]
            Started: [ server1.example.com ]
            Stopped: [ server2.example.com server3.example.com ]
       ...  
           mysqld_start_0 on server2.example.com 'OCF_PENDING' (196): call=183, status=complete, last-rc-change='Mon Sep 29 11:25:47 2014', queued=4ms, exec=2001ms
           mysqld_start_0 on server3.example.com 'not running' (7): call=179, status=complete, last-rc-change='Mon Sep 29 11:29:33 2014', queued=4ms, exec=2002ms
      
    2. Check the /var/log/mariadb/mariadb.log on each system for errors. (Note: If under control of the Pacemaker Cluster Resource Manager, use the /var/log/mysqld.log file instead)

       140929 11:25:40 [ERROR] WSREP: Local state seqno (1399488) is greater than group seqno (10068): states diverged. Aborting to avoid potential data loss. Remove '/var/lib/mysql//grastate.dat' file and restart if you wish to continue. (FATAL)
       140929 11:25:40 [ERROR] Aborting
      
    3. The clustercheck command checks whether some systems not in sync.

       [root@server2 ~]# clustercheck
       HTTP/1.1 503 Service Unavailable
       Content-Type: text/plain
       Connection: close
       Content-Length: 36
      
       Galera cluster node is not synced.
      

    Note: From OSP 6 onwards, when pacemaker’s resource agent notices that a galera node has lost quorum, it automatically stops it; Depending on how the resource is configured1, you may need to cleanup the failure for pacemaker to restart the galera cluster and regain quorum automatically.

  2. Determine systems with last activity.

  3. Determine which system or systems has the highest valid sequence number for the for the latest UUID.

    1. Orderly shutdown. If the cluster shutdown correctly the /var/lib/mysql/grastate.dat will have positive numbers for the seqno. Note which system or systems have the greatest seqno. However, if any system has a -1 value, that indicates the shutdown was not clean and another method to determine the seqno is needed.

       [root@server2 ~]# cat /var/lib/mysql/grastate.dat
       # GALERA saved state
       version: 2.1
       uuid:    b048715d-4369-11e4-b7ef-af1999a6c989
       seqno:   -1
       cert_index:
      
    2. Disorderly shutdown. The seqno is in the /var/log/mariadb/mariadb.log on RHOSP5 or /var/log/mysqld.log on RHOSP6 and later. Search for lines with Found save state, ignoring any -1 values. The last value on each line is in the form UUID:seqno.

       [root@server1 ~]# tail -n 1000 /var/log/mariadb/mariadb.log | grep "Found saved state"  | grep -v ":-1"
       140923 17:49:19 [Note] WSREP: Found saved state: b048715d-4369-11e4-b7ef-af1999a6c989:2229
       140924 15:37:13 [Note] WSREP: Found saved state: b048715d-4369-11e4-b7ef-af1999a6c989:2248
       140929 11:24:26 [Note] WSREP: Found saved state: b048715d-4369-11e4-b7ef-af1999a6c989:10060
       [root@server1 ~]#
      
       [root@server2 ~]# tail -n 1000 /var/log/mariadb/mariadb.log | grep "Found saved state"  | grep -v ":-1"
       140926 14:58:16 [Note] WSREP: Found saved state: b048715d-4369-11e4-b7ef-af1999a6c989:171535
       140929 11:24:28 [Note] WSREP: Found saved state: b048715d-4369-11e4-b7ef-af1999a6c989:1399488
       [root@server2 ~]#
      
       [root@server3 ~]# tail -n 2000 /var/log/mariadb/mariadb.log | grep "Found saved state"  | grep -v ":-1"
       140923 17:36:57 [Note] WSREP: Found saved state: b048715d-4369-11e4-b7ef-af1999a6c989:36
       140923 17:43:18 [Note] WSREP: Found saved state: b048715d-4369-11e4-b7ef-af1999a6c989:785
       [root@server3 ~]#
      
    3. Notice all servers have the same UUID (b048715d-4369-11e4-b7ef-af1999a6c989), but server2 has the largest seqno (1399488).

    4. One can also determine this seqno from pacemaker. If Pacemaker previously tried to restart the cluster, this can be retrieved from the CIB, e.g. for node1:

       [root@node1 ~]# crm_attribute -N node1 -l reboot --name galera-last-committed -Q
      
    5. If the last seqno is not present in the CIB3, it can be retrieved with MariaDB:

       [root@node1 ~]# mysqld_safe --wsrep-recover
       151002 13:59:35 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
       151002 13:59:35 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
       151002 13:59:35 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.2FkYLQ' --pid-file='/var/lib/mysql/db1-recover.pid'
       151002 13:59:50 mysqld_safe WSREP: Recovered position 4c7ba2a8-566a-11e5-8250-1e939ac17c77:9
       151002 13:59:52 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
      

      From OSP6 onwards, pacemaker’s resource agent automates these steps so that it can bootstrap the cluster. However, if it cannot retrieve the seqno from all the nodes it will not restart the cluster as it cannot determine the bootstrap node with certainty. If needed, you can override the resource agent’s decision by forcing the bootstrap node manually.

      If you need to force bootstrap, make sure to select the node with latest activity otherwise you will lose data. In the example, since server2 had the highest seqno, you can bootstrap the cluster with the following steps:

      1. Set the management of galera out of pacemaker's control temporarily.

        [root@server ~]# pcs resource unmanage galera
        
      2. Connect to server2 and run those commands locally to force the node to bootstrap the cluster.

        [root@server2 ~]# crm_attribute -N server2 -l reboot --name galera-bootstrap -v true
        [root@server2 ~]# crm_resource --force-promote -r galera -V
        
      3. Then, instruct pacemaker to re-detect the current state of the galera resource. This will clean up failcount and purge knowledge of past failures.

        [root@server2 ~]# pcs resource cleanup galera
        
      4. At this point galera is up and pacemaker knows that it is up. Give back control of galera to pacemaker and monitor the status as the remaining resources are restarted.

        [root@server2 ~]# pcs resource manage galera
        [root@server2 ~]# pcs status
        

      For restarting the cluster on OSP 5, continue by applying the following steps:

    6. Temporarily stop pacemaker control of the database.

      [root@server3 ~]# pcs resource disable mysqld-clone
      
    7. Verify no mysqld is running on any cluster member.

      [root@server3 ~]# ps -ef| grep mysqld | grep -v grep
      
    8. Start the first database on the first node.

    9. This initiates the Galera cluster. Since server2 had the highest seqno, that is the node to start first.

      [root@server2 ~]# /usr/libexec/mysqld --wsrep-cluster-address='gcomm://' &
      [1] 1910
      [root@server2 ~]# 140929 16:31:00 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295
      140929 16:31:00 [Warning] Could not increase number of max_open_files to more than 1024 (request: 1835)
      /usr/libexec/mysqld: Query cache is disabled (resize or similar command in progress); repeat this command later
      
    10. Verify that this brought the this node into sync.

              [root@server2 ~]# clustercheck
              HTTP/1.1 200 OK
              Content-Type: text/plain
              Connection: close
              Content-Length: 32
      
              Galera cluster node is synced.
      
    11. Start database on remaining nodes

    12. On another cluster member, start the database, however use the address of the node that was first started, and then verify this node reports synced.

              [root@server1 ~]#  /usr/libexec/mysqld --wsrep-cluster-address='gcomm://10.19.139.32' &
              [1] 10603
              [root@server1 ~]# 140929 16:34:17 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295
              140929 16:34:17 [Warning] Could not increase number of max_open_files to more than 1024 (request: 1835)
              /usr/libexec/mysqld: Query cache is disabled (resize or similar command in progress); repeat this command later
      
              [root@server1 ~]# clustercheck
              HTTP/1.1 200 OK
              Content-Type: text/plain
              Connection: close
              Content-Length: 32
      
              Galera cluster node is synced.
              [root@server1 ~]#
      
    13. It may take time for the node to sync, so re-issue the clustercheck command if it reports the node is not in sync at first.

    14. Repeat for all remaining cluster nodes.

    15. Re-enable pacemaker control of the database. This command can be run on any node and pacemaker will attempt to start mysqld on each node.

      [root@server1 ~]# pcs resource enable mysqld-clone
      [root@server1 ~]#
      
    16. Switch to the pacemaker controller database on each node.

    17. One node at a time, perform all the following, so that the node has mysqld running under pacemaker before moving to next node:

      1. Stop the mysqld service that was started by hand, and confirm it is stopped.

        [root@server1 ~]# kill %%
        [root@server1 ~]#
        [1]+  Done                    /usr/libexec/mysqld --wsrep-cluster-address='gcomm://1.2.3.4'
        [root@server1 ~]#
        
      2. Pacemaker should attempt to start mysqld. It may fail initially, so wait and repeat clean-ups until is it started in the current node. Once pacemaker has it started, check to confirm it is in sync.

                    [root@server1 ~]# pcs status | grep -A 3 mysqld
                     Clone Set: mysqld-clone [mysqld]
                         Stopped: [ server1.example.com server2.example.com server3.example.com ]
                     ip-1.2.3.4	(ocf::heartbeat:IPaddr2):	Started server2.example.com
                     Clone Set: openstack-keystone-clone [openstack-keystone]
                    ...
                        mysqld_start_0 on server1.example.com 'not running' (7): call=199, status=complete, last-rc-change='Mon Sep 29 16:59:37 2014', queued=10ms, exec=2001ms
                        rabbitmq-server_start_0 on server2.example.com 'OCF_PENDING' (196): call=190, status=complete, last-rc-change='Mon Sep 29 11:25:48 2014', queued=4ms, exec=2001ms
                        mysqld_start_0 on server2.example.com 'OCF_PENDING' (196): call=183, status=complete, last-rc-change='Mon Sep 29 11:25:47 2014', queued=4ms, exec=2001ms
                        mysqld_start_0 on server3.example.com 'not running' (7): call=179, status=complete, last-rc-change='Mon Sep 29 11:29:33 2014', queued=4ms, exec=2002ms
        
                    PCSD Status:
                    [root@server1 ~]# pcs resource cleanup mysqld
                    Resource: mysqld successfully cleaned up
                    [root@server1 ~]# pcs status | grep -A 3 mysqld
                     Clone Set: mysqld-clone [mysqld]
                         Started: [ server1.example.com ]
                         Stopped: [ server2.example.com server3.example.com ]
                     ip-1.2.3.4	(ocf::heartbeat:IPaddr2):	Started server2.example.com
                    ...
                        mysqld_start_0 on server2.example.com 'not running' (7): call=206, status=complete, last-rc-change='Mon Sep 29 17:09:53 2014', queued=11ms, exec=2001ms
                        mysqld_start_0 on server3.example.com 'not running' (7): call=187, status=complete, last-rc-change='Mon Sep 29 17:09:53 2014', queued=10ms, exec=2002ms
        
                    PCSD Status:
                    [root@server1 ~]# clustercheck
                    HTTP/1.1 200 OK
                    Content-Type: text/plain
                    Connection: close
                    Content-Length: 32
        
                    Galera cluster node is synced.
                    [root@server1 ~]#
        
        
    18. Repeat for all remaining cluster nodes, one at a time.

Category
Article Type