Recover from partitioned clustered OVN database

Solution Verified - Updated

Environment

  • Red Hat OpenStack Platform 17.0
  • Red Hat OpenStack Platform 17.1

Issue

  • When a controller node is replaced, the OVN database cluster can become in a partitioned state.

  • This issue is being investigated by This content is not included.RHBZ: 2222543

  • If you hit this issue and confirm using the diagnostic steps, please open a case with Red Hat Support citing this Knowledge Base Article and they will guide you to resolution.

Resolution

Use the information collected in the diagnostic steps section below, substituting the example IP addresses and Server IDs provided.

  • Manually remove the stale node from cluster b2c9 on controller-1 or controller-2:
podman exec ovn_cluster_north_db_server ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/kick OVN_Northbound ba93
  • Clear controller-0 and join it to the existing cluster b2c9 by running the following on controller-0:
podman exec ovn_cluster_north_db_server rm /var/run/ovn/ovnnb_db.db /var/lib/ovn/.ovnnb_db.db.~lock~
podman exec ovn_cluster_north_db_server ovsdb-tool join-cluster /var/lib/ovn/ovnnb_db.db OVN_Northbound tcp:172.17.1.89:6643 tcp:172.17.1.10:6643
systemctl restart tripleo_ovn_cluster_north_db_server

The same steps need to be repeated on the southbound database cluster as well:

  • On the Undercloud, gather information:
ansible -i ./overcloud-deploy/overcloud/tripleo-ansible-inventory.yaml -m shell -ba 'podman exec ovn_cluster_south_db_server ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound | grep -A4 Servers' Controller
ansible -i ./overcloud-deploy/overcloud/tripleo-ansible-inventory.yaml -m shell -ba 'podman exec ovn_cluster_south_db_server ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound | egrep "Cluster ID|Role|Server ID"' Controller
  • Kick the stale server from the main cluster:
podman exec ovn_cluster_south_db_server ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/kick OVN_Southbound <Server ID>
  • Clear the partitioned controller and rejoin the cluster:
podman exec ovn_cluster_south_db_server rm /var/run/ovn/ovnsb_db.db /var/lib/ovn/.ovnsb_db.db.~lock~
podman exec ovn_cluster_south_db_server ovsdb-tool join-cluster /var/lib/ovn/ovnsb_db.db OVN_Southbound <connection details> <leader connection details>
systemctl restart tripleo_ovn_cluster_south_db_server

Root Cause

  • When a controller is replaced, currently there is no logic to clean up the old server entry in the cluster or join the existing cluster.

  • This issue is being investigated by This content is not included.RHBZ: 2222543

Diagnostic Steps

  • On the Undercloud node, check the following commands to see if your affected:
ansible -i ./overcloud-deploy/overcloud/tripleo-ansible-inventory.yaml -m shell -ba 'podman exec ovn_cluster_north_db_server ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound | egrep "Cluster ID|Role|Server ID"' Controller

controller-2 | CHANGED | rc=0 >>
Cluster ID: b2c9 (b2c9d033-f770-4763-998d-f0d7a819fd6c)
Server ID: a7ec (a7ec9d49-2281-4811-89e1-ac50f534ad56)
Role: follower
controller-0 | CHANGED | rc=0 >>
Cluster ID: 0eea (0eea95da-5ff3-4719-9475-7bb479e7e07b)
Server ID: 1143 (1143c224-6a6b-4e0e-9ca1-b24390e44555)
Role: leader
controller-1 | CHANGED | rc=0 >>
Cluster ID: b2c9 (b2c9d033-f770-4763-998d-f0d7a819fd6c)
Server ID: ce2b (ce2be8c7-5536-4c0f-a3a8-4ee17fc202f8)
Role: leader
ansible -i ./overcloud-deploy/overcloud/tripleo-ansible-inventory.yaml -m shell -ba 'podman exec ovn_cluster_north_db_server ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound | grep -A4 Servers' Controller

controller-0 | CHANGED | rc=0 >>
Servers:
    1143 (1143 at tcp:172.17.1.89:6643) (self) next_index=3 match_index=3
controller-1 | CHANGED | rc=0 >>
Servers:
    a7ec (a7ec at tcp:172.17.1.61:6643) next_index=71 match_index=70 last msg 1789 ms ago
    ba93 (ba93 at tcp:172.17.1.89:6643) next_index=71 match_index=70 last msg 1622099 ms ago
    ce2b (ce2b at tcp:172.17.1.10:6643) (self) next_index=64 match_index=70
controller-2 | CHANGED | rc=0 >>
Servers:
    a7ec (a7ec at tcp:172.17.1.61:6643) (self)
    ba93 (ba93 at tcp:172.17.1.89:6643)
    ce2b (ce2b at tcp:172.17.1.10:6643) last msg 1838 ms ago

What this shows is:

  • controller-0 is in a cluster 0eea by itself and its connection details are tcp:172.17.1.89:6643
  • controller-0 is a stale entry in the cluster running on controller-1 and controller-2 ba93
  • controller-1 and controller-2 are in a cluster b2c9
  • controller-1 is the leader with the server ID of ce2b and the connection details of tcp:172.17.1.10:6643
SBR
Components

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.