Why do some openstack services fail to finish tasks showing "message timeout" error in the logs?

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux Openstack Platform 5
  • Red Hat Enterprise Linux Openstack Platform 6
  • Red Hat Enterprise Linux Openstack Platform 7
  • Red Hat Openstack Platform 8

Issue

  • Various openstack tasks fail with below error message in the logs. The tasks can be anything that uses messaging service like spawning instances in batches, creating networks, volumes, etc. The failure is usually random, eg. two or three instances fail to spawn if we spawn a batch of 10 instances.
MessagingTimeout: Timed out waiting for a reply to message ID xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  • Restarting rabbitmq-server service may solve this problem temporarily, but will eventually exhibit after some time.

Resolution

Be sure having applied oslo.messaging holds connections when replies fail.

On default configuration this usually happens because rabbitmq-server is configured with the default file_descriptor limit of 924 which is very low for an openstack environment. To get the existing limit, run below command.

# rabbitmqctl report | grep -A3 file_descriptor
 {file_descriptors,[{total_limit,924},
                    {total_used,924},
                    {sockets_limit,829},
                    {sockets_used,829}]},
--
 {file_descriptors,[{total_limit,924},
                    {total_used,924},
                    {sockets_limit,829},
                    {sockets_used,829}]},
--
 {file_descriptors,[{total_limit,924},
                    {total_used,924},
                    {sockets_limit,829},
                    {sockets_used,829}]},

When the server hits this limit, it cannot open a new file_descriptor to process more messages and refuses those connections which causes failure for the associated openstack task.

It's recommended to increase this to a 65436. Note that Red Hat deployment tools - RHEL-OSP-INSTALLER for RHEL-OSP6 and OSP-Director for RHEL-OSP7 - will set this correctly during the deployment, but some older versions may not set this correctly and rabbitmq will use the default file_descriptor. This is true for OSP6 deployments done via rhel-osp-installer prior to A3 and OSP7 deployments done using OSP-Director GA version. If you are using third party deployment tools like Ansible or complete manual openstack deployment, make sure that this is integrated to your tool or this is done manually. There are two scenarios to take into consideration.

Single controller deployment - No HA

In this mode, rabbitmq-server is managed via systemctl. file_descriptor can be inceased by configuring LimitNOFILE=65436 to rabbitmq-server via systemctl. Follow below steps.

  • Create a directory named /etc/systemd/system/rabbitmq-server.service.d/
#  mkdir -p /etc/systemd/system/rabbitmq-server.service.d/
  • Create limits.conf inside the directory with new limit.
# vi /etc/systemd/system/rabbitmq-server.service.d/limits.conf
[Service]
LimitNOFILE=65436
  • Restart rabbitmq service via systemclt.
# systemctl  restart rabbitmq-server

Three node HA controller deployment.

This deployment has three node HA controllers where rabbitmq-server is managed by pacemaker. When using pacemaker, pacemaker can either use systemctl or a pacemaker resource agent to manage rabbitmq where the latter is recommended. rhel-osp-installer versions prior to A3 in RHEL-OSP6 manages rabbitmq resource using systemctl. Later versions moved to use a pacemaker resource agent to manage rabbitmq. The first task here is to find out if pacemaker is using systemd or resource agent to manage rabbitmq. This can be achieved by inspecting output of below commands.

  • In RHEL-OSP6
 pcs resource show rabbitmq-server | grep class
  • On RHEL-OS7
 pcs resource show rabbitmq | grep class

If the output shows class=systemd, then it's using systemd. If the output shows class=ocf, then it's using resource agent.

Pacemaker uses systemd


If systemd is used by pacemaker, then follow below steps on all controller nodes to increase `file_descriptor` limit.
  • Create a directory named /etc/systemd/system/rabbitmq-server.service.d/
#  mkdir -p /etc/systemd/system/rabbitmq-server.service.d/
  • Create limits.conf inside the directory with new limit.
# vi /etc/systemd/system/rabbitmq-server.service.d/limits.conf
[Service]
LimitNOFILE=65436
  • Restart rabbitmq service via pcs.
# pcs resource restart rabbitmq-server

Pacemaker uses resource agent


If rabbitmq-server is managed by a pacemaker resource agent, follow below steps to increase the file_descriptor limit.
  • Create a file /etc/security/limits.d/rabbitmq-server.conf using below contents.
vi /etc/security/limits.d/rabbitmq-server.conf
rabbitmq soft nofile 65436
rabbitmq hard nofile 65436

  • Restart rabbitmq via pcs.

For RHEL-OSP6

pcs resource restart rabbitmq-server-clone

For RHEL-OSP7

pcs resource restart rabbitmq-clone

Director based update

The filedescriptors limit for RabbitMQ is controlled by OSP-Director deployment.

The following template parameter is used:

RabbitFDLimit:
  default: 4096
  description: Configures RabbitMQ FD limit
  type: string

Upgrade or scale-out of the overcloud using director the file descriptor will be reset to its default value of 4096 in OSP 7. Note that the default if 16384 in OSP 8.

How to change these values by overriding the default in an environment file to ensure Director does not apply the default value after the next stack update.

Example:

  • Default director deployment (OSP 7):
[root@overcloud-controller-0 ~]# rabbitmqctl report | grep -A3 file_descriptor
 {file_descriptors,[{total_limit,3996},
                    {total_used,72},
                    {sockets_limit,3594},
                    {sockets_used,70}]},
  • Environment file:
[stack@undercloud ~]$ cat templates/my-rabbit-fd-limits.yaml
parameter_defaults:
  RabbitFDLimit: '65436'
  • Add env file to deployment
$ cd /home/stack
$ openstack overcloud deploy \
  --templates /home/stack/templates/my-overcloud/ \
  --ntp-server 192.168.102.11 \
  --control-scale 1 \
  --compute-scale 1 \
  --neutron-network-type vxlan \
  --neutron-tunnel-types vxlan \
  -e /home/stack/templates/my-rabbit-fd-limits.yaml
(...)
Deploying templates in the directory /home/stack/templates/my-overcloud
Overcloud Endpoint: http://192.168.201.37:5000/v2.0
Overcloud Deployed
  • After orverriding paramerter in environment file, and re-running deploy:
[root@overcloud-controller-0 ~]# rabbitmqctl report | grep -A3 file_descriptor
 {file_descriptors,[{total_limit,65336},
                    {total_used,68},
                    {sockets_limit,58800},
                    {sockets_used,66}]},

Verification

Finally verify that the new file_descriptor limit has taken effect on all controller nodes.

# rabbitmqctl report | grep -A3 file_descriptor
 {file_descriptors,[{total_limit,65336},
                    {total_used,3},
                    {sockets_limit,58800},
                    {sockets_used,1}]},
--
 {file_descriptors,[{total_limit,65336},
                    {total_used,3},
                    {sockets_limit,58800},
                    {sockets_used,1}]},
--
 {file_descriptors,[{total_limit,65336},
                    {total_used,119},
                    {sockets_limit,58800},
                    {sockets_used,117}]},

Root Cause

This usually happens because rabbitmq-server is configured with the default file_descriptor limit of 924 which is very low for an openstack environment. To get the existing limit, run below command.

Important:Another cause is not having applied the following erratas described at described at oslo.messaging holds connections when replies fail.

Diagnostic Steps

rabbitmq-server reserves some of the file descriptors for internal purposes (this is an undocumented feature).

The relevant section in the code is
Content from github.com is not included.Content from github.com is not included.https://github.com/rabbitmq/rabbitmq-server/search?utf8=%E2%9C%93&q=RESERVED_FOR_OTHERS

-define(RESERVED_FOR_OTHERS, 100).
SBR
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.