Configuring Instance High Availability in Red Hat OpenStack Platform on OpenShift

Updated

Introduction

The Instance High Availability service checks each compute node’s health by querying the nova database via nova API.

If any of the computes is detected to be “enabled” but not reporting its status for a configurable amount of time (default = 30 seconds) or its status is “down” it will be considered for evacuation. Computes that have been manually disabled (status disabled in nova) would be ignored as it is assumed operators would have disabled them for maintenance purposes.

NOTE: by default the instanceha script evaluates how many compute nodes are impacted. If more than half of them are detected as down it will not evacuate them as the failure scenario is assumed to be impacting the entire aggregates/site. This threshold is configurable (see parameters section below).

When one or more compute nodes are detected as failed, the instanceha script filters out all the empty computes and eventually those that are not part of aggregates tagged with specific metadata.

Optionally, the instanceha can also detect if a compute node is capturing a kernel dump and will wait for kdump to finish before proceeding with its evacuation. This needs the kdump service on the compute nodes to be configured to emit UDP notifications.

Evacuation workflow
  1. Verifies if a compute hosts vms that should be evacuated (checking 'evacuable' metadata in either image or flavor)
  2. If the operator has reserved spare compute nodes tries to enable one per each failed compute. Right now ignore if no available capacity is left.
  3. Performs ipmi/redfish/metal3 based fencing OFF. fencing.yaml needs to be populated with ip/port/user/pass details for each compute. See example below as the format is slightly different than tripleo’s. If computes have been deployed using metal3/baremetal operator please use the ‘bmh’ driver.
  4. Calls nova to mark the host as force_down and disables it explicitly, adding a meaningful message and a timestamp in the "Disable Reason" field
  5. Performs evacuation of workloads. Only vms that are in a state of 'ACTIVE', 'ERROR', 'STOPPED' will be evacuated.
  6. If operators decide to not re-enable computes after evacuation the script will leave the host disabled, otherwise it will re-enable it.
  7. Once the evacuation is completed the instanceha script will try to power on the compute node via redfish/ipmi.
  8. The script periodically polls each compute that is forced_down to make sure evacuations have been moved from ‘done’ to ‘completed’ (happens after a compute reboot). If so, it removes the force_disable flag.

Installing and configuring Instance HA

The instanceha service is part of the infra-operator, and its crd is installed by default.

NOTE: Instance HA is not supported on RHOSP hyperconverged infrastructures (HCI) environments. To use Instance HA in your RHOSP HCI environment, you must designate a subset of the Compute nodes with the ComputeInstanceHA role to use the Instance HA. Red Hat Ceph Storage services must not be hosted on the Compute nodes that host Instance HA.

It can be instantiated by using a yaml file like the following:

[zuul@controller-0 ~]$ cat iha-0.yaml
---
apiVersion: instanceha.openstack.org/v1beta1
kind: InstanceHa
metadata:
  name: instanceha-0
spec:
  caBundleSecretName: combined-ca-bundle
  #networkAttachments: ['internalapi']
  #openStackCloud: "default"
  #openStackConfigMap:
  #openStackConfigSecret:
  fencingSecret: fencing-0
  #instanceHaConfigMap:
  #instanceHaKdumpPort:

Spec options:

// OpenStackClould is the name of the Cloud to use as per clouds.yaml (will be set to "default" if empty)
OpenStackCloud string `json:"openStackCloud"`

// OpenStackConfigMap is the name of the ConfigMap containing the clouds.yaml
OpenStackConfigMap string `json:"openStackConfigMap"`

// OpenStackConfigSecret is the name of the Secret containing the secure.yaml (admin password)
OpenStackConfigSecret string `json:"openStackConfigSecret"`

// FencingSecret is the name of the Secret containing the fencing details
FencingSecret string `json:"fencingSecret"`

// InstanceHaConfigMap is the name of the ConfigMap containing the InstanceHa config file
InstanceHaConfigMap string `json:"instanceHaConfigMap"`

// InstanceHaKdumpPort is the UDP port used to receive kdump notifications
InstanceHaKdumpPort int32 `json:"instanceHaKdumpPort"`

// NodeSelector to target subset of worker nodes running control plane services (currently only applies to KeystoneAPI and PlacementAPI)
NodeSelector map[string]string `json:"nodeSelector,omitempty"`

// NetworkAttachments is a list of NetworkAttachment resource names to expose
// the services to the given network
NetworkAttachments []string `json:"networkAttachments,omitempty"`

Please adjust the caBundleSecretName if using a non-standard one.

Users should also create a fencing-secret with the ipmi, redfish or baremetal operator/ metal3 details of each compute node (compute names need to be the same as known by nova):

[zuul@controller-0 ~]$ cat fencing-0.yaml
---
apiVersion: v1
kind: Secret
metadata:
  name: fencing-0
stringData:
  fencing.yaml: |
    FencingConfig:
     compute-0: #[SHORT HOSTNAME OF THE COMPUTE as known by nova]
    	agent: redfish #[AGENT=redfish,ipmi,bmh]
    	ipaddr: 192.168.111.9 #[IP ADDRESS OF ILO/DRAC/ETC]
    	ipport: 8000
    	login: admin
    	passwd: password
    	uuid: 2b399d93-85d4-4aaf-a744-6b9c69b7edb3 #[REDFISH NODE UUID]
     compute-1:
    	agent: redfish
    	ipaddr: 192.168.111.9
    	ipport: 8000
    	login: admin
    	passwd: password
    	uuid: b7d32e6b-edbc-477d-80bf-4cda77ada8cb
     compute-2:
    	agent: ipmi
    	ipaddr: 192.168.111.9
    	ipport: 443
    	login: admin
    	passwd: password
     compute-3:
       agent: bmh
       host: compute-3 #[NAME OF THE BMH RESOURCE]
       namespace: openstack #[NAMESPACE OF THE BMH RESOURCE]
       token: XXXX #[OPENSHIFT SERVICEACCOUNT TOKEN]

In case compute nodes are provisioned using metal3/baremetal-operator please make sure to use the “bmh” fencing agent and set the host value to be the same as the name of the resource as show in “oc get bmh”:

[cifmw@haa-09 ~]$ oc get bmh
NAME         	STATE     	CONSUMER          	ONLINE   ERROR   AGE
edpm-compute-0   provisioned   openstack-edpm-ipam   true         	17h
edpm-compute-1   provisioned   openstack-edpm-ipam   true         	17h

You should create a serviceaccount allowed to perform operations on the baremetalhost resources, for more information see: This content is not included.https://docs.redhat.com/en/documentation/openshift_container_platform/4.16/html/authentication_and_authorization/index .

Example token generation (please notice this is a cluster-admin role):

$ kubectl create serviceaccount k8sadmin -n kube-system 
$ kubectl create clusterrolebinding k8sadmin --clusterrole=cluster-admin --serviceaccount=kube-system:k8sadmin 
$ kubectl -n kube-system describe secret $(sudo kubectl -n kube-system get secret | (grep k8sadmin || echo "$_") | awk '{print $1}') | grep token: | awk '{print $2}'

Apply both the fencing and iha files with:

[zuul@controller-0 ~]$ oc apply -f fencing-0.yaml
[zuul@controller-0 ~]$ oc apply -f iha-0.yaml

After a few seconds the instanceha resource should move to “Setup complete”

[zuul@controller-0 ~]$ oc get instanceha
NAME       	STATUS   MESSAGE
instanceha-0   True 	Setup complete

And the respective pod should be running:

[zuul@controller-0 ~]$ oc get pods |grep instanceha
instanceha-0-54f865b6dd-w6h4t                                 	1/1 	Running 	0      	10h

If the user has not pre-created a configmap containing the instanceha configuration file the operator will automatically create one named instanceha-config:

[zuul@controller-0 ~]$ oc get cm instanceha-config -o yaml
apiVersion: v1
data:
  config.yaml: |
	config:
  	EVACUABLE_TAG: "evacuable"
TAGGED_FLAVORS: "true"
TAGGED_IMAGES: "true"
  	SMART_EVACUATION: "false"
  	DELTA: "30"
  	POLL: "45"
  	THRESHOLD: "50"
  	WORKERS: "4"
  	RESERVED_HOSTS: "false"
  	LEAVE_DISABLED: "false"
  	CHECK_KDUMP: "false"
  	LOGLEVEL: "info"
  	DISABLED: "false"
kind: ConfigMap
Configuration Options
  • EVACUABLE_TAG defaults to 'evacuable'. This is used to mark flavors/images/aggregates to be evacuated
  • TAGGED_FLAVORS defaults to ‘true’. This is used to enable/disable checking for tagged flavors when deciding what to evacuate.
  • TAGGED_IMAGES defaults to ‘true’. This is used to enable/disable checking for tagged images when deciding what to evacuate.
  • SMART_EVACUATION defaults to ‘false’. Controls w/e the evacuation will be monitored or not.

NOTE: if evacuations fail when using SMART_EVACUATION=true the compute will be left ‘disabled’ and ‘force_down’. It is expected that manual cleanup will be performed in such cases.

  • WORKERS defaults to 4. Controls the number of vms evacuated in parallel if SMART_EVACUATION=true.
  • DELTA defaults to 30 seconds. Controls how fast we want to react / how much we want to wait before considering a host impacted.
  • POLL defaults to 45 seconds. Controls how often we query the nova api for hypervisors status.
  • THRESHOLD defaults to 50 (%). How many compute nodes can fail before the failure scenario becomes of the disaster recovery type and IHA adopts a hand-off approach.
  • RESERVED_HOSTS defaults to ‘false’. Controls w/e IHA should check if there are spare compute nodes to re-enable before evacuating a broken compute. Host needs to be 'disabled' with a reason containing the word 'reserved'.
openstack compute service set --disable --disable-reason reserved compute-1 nova-compute
  • LEAVE_DISABLED defaults to ‘false’. Controls w/e IHA should re-enable a compute after evacuation.
  • CHECK_KDUMP defaults to ‘false’. Controls w/e IHA should check if a compute is performing kdump before fencing/evacuation. Needs the compute nodes to be configured to send notifications, see https://access.redhat.com/solutions/2876971 and the InstanceHa to be configured to listen on the internalapi or alternative network/vlan shared with the computes. You can use the networkAttachments spec parameter to define to which network the instanceha pod should be attached to:
[zuul@controller-0 ~]$ cat iha.yaml
apiVersion: instanceha.openstack.org/v1beta1
kind: InstanceHa
metadata:
  name: instanceha-0
spec:
  networkAttachments: ['internalapi']
  caBundleSecretName: combined-ca-bundle

Note: detecting kdumping compute nodes requires reverse dns lookup, so that the source ip address can be translated into the respective compute (short) hostname as known by nova. For example, if the compute is known by nova as compute-gbr8p6xv-0.ctlplane.example.com and it is using the internalapi network to send kdump notifications, the internalapi compute ip 172.17.0.100 should resolve to compute-gbr8p6xv-0.internalapi.ocp.openstack.lab or any other similarly named alias.

Note: you may have to install the fence-agents-kdump if not already present on the computes before generating the kdump initrd image.

Warning: If you set check_kdump=true please ensure POLL is equal to 45s or higher to accommodate for the internal check_kdump timeout of 30 seconds, otherwise this may result in failures in evacuating instances.

Note: kexec-tools do not support ovs bonds/bridges so the internalapi (or alternative) network should either be a untagged interface or tagged vlan, for example:

[root@compute-gbr8p6xv-0 ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth1.20
DEVICE=eth1.20
MTU=1496
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=yes
PEERDNS=no
VLAN=yes
PHYSDEV=eth1
BOOTPROTO=static
NETMASK=255.255.255.0
IPADDR=172.17.0.100
Example compute kdump.conf:

[root@compute-gbr8p6xv-0 ~]# cat /etc/kdump.conf
auto_reset_crashkernel yes
path /var/crash
core_collector makedumpfile -l --message-level 7 -d 31
kdump_post /var/crash/sleep.sh
fence_kdump_nodes 172.17.0.36
fence_kdump_args -p 7410 -f auto -c 0 -i 5
  • DISABLED defaults to ‘false’. Controls w/e IHA should perform evacuation of computes after a fault. This essentially puts IHA into a “hands off” mode if set to true.
  • LOGLEVEL defaults to ‘info’. Controls the verbosity of the logs. Change to “debug” to increase it.

Performing maintenance with Instance HA

In case a compute node needs to be temporarily excluded from the instanceha monitoring, for example because it needs to be rebooted, it can be simply disabled using the nova api:

$ openstack compute service set --disable --disable-reason "maintenance" compute nova-compute

Alternatively Intance HA can be temporarily disabled by setting DISABLED=”true” in its config file:

[zuul@controller-0 ~]$ oc edit cm instanceha-config
..
apiVersion: v1
data:
  config.yaml: |
	config:
  	...
  	DISABLED: "true"

Save and quit to persist the configuration.

Example failure / evacuation logs

[zuul@controller-0 ~]$ oc logs instanceha-0-54f865b6dd-w6h4t
2024-09-15 23:19:07,065 INFO Nova login successful
2024-09-15 23:21:38,105 WARNING The following computes are down:['compute-0.ctlplane.example.com']
2024-09-15 23:21:39,137 INFO Fencing compute-0.ctlplane.example.com
2024-09-15 23:21:39,137 INFO Fencing host compute-0.ctlplane.example.com off
2024-09-15 23:21:39,407 INFO Power off of compute-0.ctlplane.example.com ok
2024-09-15 23:21:39,824 INFO Nova login successful
2024-09-15 23:21:39,824 INFO Disabling compute-0.ctlplane.example.com before evacuation
2024-09-15 23:21:39,824 INFO Forcing compute-0.ctlplane.example.com down before evacuation
2024-09-15 23:21:40,094 INFO Service nova-compute on host compute-0.ctlplane.example.com is now disabled
2024-09-15 23:21:40,094 INFO Start evacuation of compute-0.ctlplane.example.com
2024-09-15 23:21:41,740 INFO Evacuation successful. Re-enabling compute-0.ctlplane.example.com
2024-09-15 23:21:41,741 INFO Fencing host compute-0.ctlplane.example.com on
2024-09-15 23:21:42,486 INFO Power on of compute-0.ctlplane.example.com ok
2024-09-15 23:21:42,486 INFO Trying to enable compute-0.ctlplane.example.com
2024-09-15 23:21:42,572 INFO Host compute-0.ctlplane.example.com is now enabled
2024-09-15 23:22:43,074 INFO Unsetting force-down on host compute-0.ctlplane.example.com after evacuation
2024-09-15 23:22:43,187 INFO Successfully unset force-down on host compute-0.ctlplane.example.com

Troubleshooting

In the event of slow or unsuccessful evacuations, computes may end up in a forced-down state. You can see the current status via:

sh-5.1$ openstack compute service list --long
+--------------------------------------+----------------+-----------------------------------------+----------+---------+-------+----------------------------+-----------------+-------------+
| ID                               	| Binary     	| Host                                	| Zone 	| Status  | State | Updated At             	| Disabled Reason | Forced Down |
+--------------------------------------+----------------+-----------------------------------------+----------+---------+-------+----------------------------+-----------------+-------------+
...
| c9590263-7fbd-4782-85d8-7cf7526ed292 | nova-compute   | compute-fn93pyp7-0.ctlplane.example.com | nova 	| enabled | up	| 2024-10-08T04:59:23.000000 | None        	| False   	|
| 3319da3d-31e8-4877-a9d5-9f407e1356fa | nova-compute   | compute-fn93pyp7-1.ctlplane.example.com | nova 	| enabled | up	| 2024-10-08T04:59:18.000000 | None        	| False   	|
| 1b89c4b9-bf16-45c0-9517-8f652ea5a129 | nova-compute   | compute-fn93pyp7-2.ctlplane.example.com | nova 	| enabled | down  | 2024-10-08T04:59:18.000000 | None        	| True    	|

Instanceha continuously polls for evacuations to be finished and if so, it will automatically re-enables compute nodes by unsetting the “Forced Down” flag.
If this does not happen in a timely fashion you can clean up the state via:

sh-5.1$ openstack compute service set --up compute-fn93pyp7-2.ctlplane.example.com nova-compute

Uninstalling instanceha

Delete the instanceha crd, and eventually the configmap containing the config file and fencing secrets:

[zuul@controller-0 ~]$ oc delete instanceha/instanceha-0
[zuul@controller-0 ~]$ oc delete secret/fencing-0
[zuul@controller-0 ~]$ oc delete cm/instanceha-config
SBR
Category
Article Type