Recover from partitioned clustered OVN database
Environment
- Red Hat OpenStack Platform 17.0
- Red Hat OpenStack Platform 17.1
Issue
-
When a controller node is replaced, the OVN database cluster can become in a partitioned state.
-
This issue is being investigated by This content is not included.RHBZ: 2222543
-
If you hit this issue and confirm using the diagnostic steps, please open a case with Red Hat Support citing this Knowledge Base Article and they will guide you to resolution.
Resolution
Use the information collected in the diagnostic steps section below, substituting the example IP addresses and Server IDs provided.
- Manually remove the stale node from cluster
b2c9oncontroller-1orcontroller-2:
podman exec ovn_cluster_north_db_server ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/kick OVN_Northbound ba93
- Clear
controller-0and join it to the existing clusterb2c9by running the following oncontroller-0:
podman exec ovn_cluster_north_db_server rm /var/run/ovn/ovnnb_db.db /var/lib/ovn/.ovnnb_db.db.~lock~
podman exec ovn_cluster_north_db_server ovsdb-tool join-cluster /var/lib/ovn/ovnnb_db.db OVN_Northbound tcp:172.17.1.89:6643 tcp:172.17.1.10:6643
systemctl restart tripleo_ovn_cluster_north_db_server
The same steps need to be repeated on the southbound database cluster as well:
- On the Undercloud, gather information:
ansible -i ./overcloud-deploy/overcloud/tripleo-ansible-inventory.yaml -m shell -ba 'podman exec ovn_cluster_south_db_server ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound | grep -A4 Servers' Controller
ansible -i ./overcloud-deploy/overcloud/tripleo-ansible-inventory.yaml -m shell -ba 'podman exec ovn_cluster_south_db_server ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound | egrep "Cluster ID|Role|Server ID"' Controller
- Kick the stale server from the main cluster:
podman exec ovn_cluster_south_db_server ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/kick OVN_Southbound <Server ID>
- Clear the partitioned controller and rejoin the cluster:
podman exec ovn_cluster_south_db_server rm /var/run/ovn/ovnsb_db.db /var/lib/ovn/.ovnsb_db.db.~lock~
podman exec ovn_cluster_south_db_server ovsdb-tool join-cluster /var/lib/ovn/ovnsb_db.db OVN_Southbound <connection details> <leader connection details>
systemctl restart tripleo_ovn_cluster_south_db_server
Root Cause
-
When a controller is replaced, currently there is no logic to clean up the old server entry in the cluster or join the existing cluster.
-
This issue is being investigated by This content is not included.RHBZ: 2222543
Diagnostic Steps
- On the Undercloud node, check the following commands to see if your affected:
ansible -i ./overcloud-deploy/overcloud/tripleo-ansible-inventory.yaml -m shell -ba 'podman exec ovn_cluster_north_db_server ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound | egrep "Cluster ID|Role|Server ID"' Controller
controller-2 | CHANGED | rc=0 >>
Cluster ID: b2c9 (b2c9d033-f770-4763-998d-f0d7a819fd6c)
Server ID: a7ec (a7ec9d49-2281-4811-89e1-ac50f534ad56)
Role: follower
controller-0 | CHANGED | rc=0 >>
Cluster ID: 0eea (0eea95da-5ff3-4719-9475-7bb479e7e07b)
Server ID: 1143 (1143c224-6a6b-4e0e-9ca1-b24390e44555)
Role: leader
controller-1 | CHANGED | rc=0 >>
Cluster ID: b2c9 (b2c9d033-f770-4763-998d-f0d7a819fd6c)
Server ID: ce2b (ce2be8c7-5536-4c0f-a3a8-4ee17fc202f8)
Role: leader
ansible -i ./overcloud-deploy/overcloud/tripleo-ansible-inventory.yaml -m shell -ba 'podman exec ovn_cluster_north_db_server ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound | grep -A4 Servers' Controller
controller-0 | CHANGED | rc=0 >>
Servers:
1143 (1143 at tcp:172.17.1.89:6643) (self) next_index=3 match_index=3
controller-1 | CHANGED | rc=0 >>
Servers:
a7ec (a7ec at tcp:172.17.1.61:6643) next_index=71 match_index=70 last msg 1789 ms ago
ba93 (ba93 at tcp:172.17.1.89:6643) next_index=71 match_index=70 last msg 1622099 ms ago
ce2b (ce2b at tcp:172.17.1.10:6643) (self) next_index=64 match_index=70
controller-2 | CHANGED | rc=0 >>
Servers:
a7ec (a7ec at tcp:172.17.1.61:6643) (self)
ba93 (ba93 at tcp:172.17.1.89:6643)
ce2b (ce2b at tcp:172.17.1.10:6643) last msg 1838 ms ago
What this shows is:
- controller-0 is in a cluster
0eeaby itself and its connection details aretcp:172.17.1.89:6643 - controller-0 is a stale entry in the cluster running on controller-1 and controller-2
ba93 - controller-1 and controller-2 are in a cluster
b2c9 - controller-1 is the leader with the server ID of
ce2band the connection details oftcp:172.17.1.10:6643
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.