Common Issues with Red Hat Openstack Platform

Solution Verified - Updated 2 Aug 2024

Environment

Red Hat Openstack Platform 5
Red Hat Openstack Platform 6
Red Hat Openstack Platform 7

Issue

I have deployed an OpenStack environment. How can I validate this environment in order to:
- verify my deployment conforms to Red Hat recommended configurations?
- confirm my deployment has workarounds or solutions applied for known problems?
Are there any check lists that I can go through to verify whether this environment will hit any known problems in the future?
What are common problems I can avoid in an OpenStack environment?

Resolution

Red Hat Openstack Platform 7

RabbitMQ File Descriptors Limit

By default, the rabbitmq-server is configured with a file_descriptor limit of 1024, which is very low for an OpenStack environment. When the server hits this limit, it cannot open new file_descriptors to process new messages and will refuse connections from various OpenStack services causing failure of the associated OpenStack tasks. It is recommended that you increase this limit to 65436.

For instructions on how to increase this limit, see this document:
Why do some Openstack services fail to finish tasks showing "message timeout" error in the logs?

Message Queues Without A Consumer Can Slow Down Operations And Should Be Monitored

Messages in certain RabbitMQ queues may keep growing without being consumed if the consumers go down or crash. The more the messages in the queue without being consumed, the more it starts slowing down operations from Openstack services.

It's recommended to monitor such queues and Set TTL, if applicable, based on recommendations from article Openstack services are slowing down while RabbitMQ messages are growing up without being consumed. to proactively prevent future issues with the Openstack deployment.

Configure Heat/Director to support scaling out compute nodes

RHELOSP Director uses Heat to deploy overcloud nodes. By default, Heat is configured for only 1000 max_resources_per_stack. Scaling out compute nodes to more than ~30 on an overcloud deployment will end up hitting this limit and would fail with error Maximum resources per stack exceeded as each compute node would end up having ~30 resources.

It's recommended to increase this to an appropriate value if scaling compute nodes above 30 is required by following steps in Attempt to scale out compute nodes in RHEL Openstack 7 fails with "Maximum resources per stack exceeded"

Forged Transmit Should Be Enabled If Undercloud Node Is A Virtual Machine On VMWare ESXi.

When undercloud is a virtual machine running on VMware ESXi, DHCP during Introspection will be successful, but it may fail during deployment stage if Forged Transmit is disabled for the interface used for provisioning on the undercloud vm. Follow steps in [Why overcloud nodes fail to get DHCP IP from undercloud node during deployment?](https://access.redhat.com/solutions/1980283) to get more details on how to enable Forged Transmit for an interface in VMware ESXi.

Instances using CPU-Pinning with an underlying numa-cell fail to migrate (live or static) to other compute nodes

Instance migration fails when using cpu-pinning from a numa-cell and an instance flavor-property "hw:cpu_policy=dedicated"

OpenStack overcloud scale out gets stuck on Compute and NovaComputeDeployment due to issues with os-collect-config

When attempting to scale out an overcloud, this process can hang on Compute and NovaComputeDeployment. This may be caused by os-collect-config being stuck on the affected nodes. This can be resolved by restarting os-collect-config on the affected nodes.

For more information, see:
OpenStack overcloud scale out gets stuck on Compute and NovaComputeDeployment due to issues with os-collect-config

How to limit the number of simultaneous / concurrent builds of overcloud nodes in OSP Director

When deploying or modifying an overcloud and performance issues are seen, the max number of concurrent builds can be modified in the undercloud. This is described here:

How to limit the number of simultaneous / concurrent builds of overcloud nodes in OSP Director

Red Hat Openstack Platform 6

RabbitMQ File Descriptors Limit

For instructions on how to increase this limit, see this document:
Why do some Openstack services fail to finish tasks showing "message timeout" error in the logs?

Confirm availability of free entitlements for all nodes before redeploying

All nodes in a deployment need a valid entitlement to get them registered using subscription-manager. Each time an environment is deployed using the Openstack installer/Director, all of the compute and controller nodes are registered to Red Hat Subscription Management. Each registration will consume one of the free entitlements and will not reuse entitlement used during previous registration. Once all free entitlements are exhausted, registration would fail and, as a result, puppet will also fail due to failure to install package required to set up controller/compute nodes. This will end up in having a failed deployment.

To avoid this situation it's recommended to verify availability of enough free entitlements for all nodes before starting to redeploy by following steps in OSP installer fails after redeployment

Heat-API should be configured to spawn workers equal to the number of cores on the controller nodes

By default heat-api is configured to spawn only one worker on each controller. If there are too many requests to heat-api, the single worker can get overloaded and subsequently requests will randomly time out. It's recommend to configure heat-api to spawn workers to be equal to the number of cores on the system. Follow steps in Request to heat-api times out randomly with error "Unable to establish connection to" to address this.

Instances fail to create due to

Unexpected vif_type=binding_failed

This content is not included.Instance fails to create, errors in nova.log "NovaException: Unexpected vif_type=binding_failed"

Red Hat Openstack Platform 5

Orchestration did not expect partial deletion of `OS::Neutron::LoadBalancer` resources

Prior to this update, Orchestration (Heat) did not expect partial deletion of OS::Neutron::LoadBalancer resources.
As a result, it was not possible to completely delete the resource if deletion previously failed after some members
had already been removed.
With this update, Orchestration is able to delete partially-deleted OS::Neutron::LoadBalancer resources. Consequently,
should resource deletion fail for any reason partially through the process, it can still be deleted in a subsequent attempt.

Issue with deleting Cinder volumes

Can't delete Cinder volume

Issue with Cinder volume not attaching correctly

This content is not included.Cinder volume will not attach to running instance

Instances fail to create due to

Unexpected vif_type=binding_failed

This content is not included.Instance fails to create, errors in nova.log "NovaException: Unexpected vif_type=binding_failed"

qemu-kvm guests panic due to hung task time-out

qemu-kvm guests panic due to hung task time-out caused by a missing memory barrier in QEMU's AIO code

Live Migrations between two compute hosts with different CPU features results in an error

Adding a config option that allows admins to disable the Nova CPU check, leaving it up to libvirt to determine the CPU compatibility. This works fine provided that all Nova compute nodes in the cloud have been configured with a single baseline CPU model that all are capable of running.

Attempting to live-migrate an instance (on shared NFS storage) between two compute hosts with different cpu features results in an error

SBR

Stack

Product(s)

Red Hat OpenStack Platform

Category

Troubleshoot

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Common Issues with Red Hat Openstack Platform

Environment

Issue

Resolution

Red Hat Openstack Platform 7

RabbitMQ File Descriptors Limit

Message Queues Without A Consumer Can Slow Down Operations And Should Be Monitored

Configure Heat/Director to support scaling out compute nodes

Forged Transmit Should Be Enabled If Undercloud Node Is A Virtual Machine On VMWare ESXi.

Instances using CPU-Pinning with an underlying numa-cell fail to migrate (live or static) to other compute nodes

OpenStack overcloud scale out gets stuck on Compute and NovaComputeDeployment due to issues with os-collect-config

How to limit the number of simultaneous / concurrent builds of overcloud nodes in OSP Director

Red Hat Openstack Platform 6

RabbitMQ File Descriptors Limit

Confirm availability of free entitlements for all nodes before redeploying

Heat-API should be configured to spawn workers equal to the number of cores on the controller nodes

Instances fail to create due to

Red Hat Openstack Platform 5

Orchestration did not expect partial deletion of OS::Neutron::LoadBalancer resources

Issue with deleting Cinder volumes

Issue with Cinder volume not attaching correctly

Instances fail to create due to

qemu-kvm guests panic due to hung task time-out

Live Migrations between two compute hosts with different CPU features results in an error

Orchestration did not expect partial deletion of `OS::Neutron::LoadBalancer` resources