FIPs stopped working after network/controller node replacement

Solution Verified - Updated

Environment

  • Red Hat Openstack 16 (RHOSP16) - OVS/ML2
  • Red Hat Openstack 17 (RHOSP17) - OVS/ML2

Issue

  • After a network node replacement instances behind some specific routers are not reachable anymore via their FIP.
  • The router seems in a standby state in all the active nodes.

Resolution

  • turn off the removed node

Root Cause

  • keepalive takes care of moving around a VIP address to the qrouter namespace on the node which it picks as active. When the keepalive VIP address appears on the host, there is neutron-keepalived-state-change-monitor process which monitors that and notifies neutron-l3-agent that something has changed, and then neutron-l3-agent sets up or down other interfaces in the qrouter namespace and updates status of the router in the neutron db.
  • A replaced controller/networker node kept running in the cluster. The old networker node keep keepalived running that is serving the router, and it's the one hosting the router becoming the master. As per the documentation only the agents from an old node need to be deleted. After that, vxlan endpoint is kept in the database, so the remaining OVS agent will attempt to establish a tunnel to the deleted node. So if the deleted node is still up can be the active one.

NOTE: TripleO undercloud automated cleaning is disabled by default[Content from docs.openstack.org is not included.1]. So if you want to be sure the nodes are cleaned you need to enable this option.

Diagnostic Steps

  • The router seems in standby state in all the nodes, like e.g.:
openstack router list
+--------------------------------------+---------+--------+-------+----------------------------------+------+
| ID                                   | Name    | Status | State | Project                          | HA   |
+--------------------------------------+---------+--------+-------+----------------------------------+------+
| 892d9481-579a-46d1-8873-0b48778bb163 | router1 | ACTIVE | UP    | c0d5e23caee74f0c983abff6fa3a025f | True |
+--------------------------------------+---------+--------+-------+----------------------------------+------+
neutron l3-agent-list-hosting-router router1
+--------------------------------------+--------------------+----------------+-------+----------+
| id                                   | host               | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------+----------------+-------+----------+
| 1cf0fafb-2cb4-46c2-8de0-fecebeb8e3cc | net-1.redhat.local | True           | :-)   | standby  |
| 951ef827-2cf8-4655-83e2-362dacbf8ed3 | net-0.redhat.local | True           | :-)   | standby  |
+--------------------------------------+--------------------+----------------+-------+----------+

if there would be e.g. no VRRP communication between nodes it would be active on more than one node, not standby.

  • Check if keepalived on all network nodes set router to be backup. you can check in the qrouter namespace if the VIP address is not set on any of the nodes in that namespace.
# ip netns
qrouter-892d9481-579a-46d1-8873-0b48778bb163 (id: 2)
qdhcp-c3446526-ab42-403e-ae14-d5c3f1da50e3 (id: 1)
qdhcp-53ff131e-1a51-4b7d-a1b7-a47b8a6ff9f4 (id: 0)
# ip netns exec qrouter-892d9481-579a-46d1-8873-0b48778bb163 bash
# ip a
22: ha-a1ae43e9-c6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:27:2c:81 brd ff:ff:ff:ff:ff:ff
    inet 169.254.194.98/18 brd 169.254.255.255 scope global ha-a1ae43e9-c6
  • To know IP address which keepalived is managing, you can check in the keepalived.conf file, like e.g.:
cat var/lib/neutron/ha_confs/892d9481-579a-46d1-8873-0b48778bb163/keepalived.conf
vrrp_instance VR_131 {
    ...
    virtual_ipaddress {
        169.254.0.96/24 dev ha-a1ae43e9-c6
    }
  • Check if the removed node is still up and have the qrouter namespace active and running

NOTE: To understand which namespace and file to check you should use the router id in the example is 892d9481-579a-46d1-8873-0b48778bb163

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.