Satellite6 or Capsule stops processing pulp tasks

Solution Verified - Updated

Environment

  • Red Hat Satellite 6 or external Capsule

Issue

  • Satellite's tasks like Actions::Katello::Repository::Sync or Actions::Katello::Host::UploadPackageProfile or Actions::Katello::Host::GenerateApplicability or Actions::Katello::ContentView::Publish (or Capsule sync for Capsule) are pending for many hours without a progress.
  • Dynflow steps are inwaiting on pulp to start the task status, optionally they timeout after 1 hour with RestClient::Exceptions::ReadTimeout error.

Resolution

Please note this workaround applies in very specific situation where all Diagnostic steps apply, only.

  • As a permanent solution, upgrade qpid-cpp packages to 1.36.0-32.el7_9amq or higher.

  • As a workaround, restart pulp_resource_manager service:

     #  systemctl restart pulp_resource_manager.service
    
  • As a preventive action, one can disable SSL from pulp<->qpidd communication. Since the communication happens over loopback interface only, this remedy step is safe. To do so, apply these configuration changes:

    • In /etc/qpid/qpidd.conf, comment out line require-encryption=yes by adding # at the beginning of the line
    • In /etc/pulp/server.conf, apply this diff ('-' means old content, '+' is new content of the line)
    @@ -266,7 +266,7 @@ oauth_secret: secret
     #     The AMQP URL for event notifications. Defaults to 'qpid://localhost:5672/'.
     
     [messaging]
    -url: ssl://localhost:5671
    +url: tcp://localhost:5672
     transport: qpid
     auth_enabled: false
     cacert: /etc/pki/pulp/qpid/ca.crt
    @@ -311,8 +311,8 @@ event_notifications_enabled: false
     #     going missing incorrectly. Defaults to 30.
     
     [tasks]
    -broker_url: qpid://localhost:5671
    -celery_require_ssl: true
    +broker_url: qpid://localhost:5672
    +#celery_require_ssl: true
     cacert: /etc/pki/pulp/qpid/ca.crt
     keyfile: /etc/pki/pulp/qpid/client.crt
     certfile: /etc/pki/pulp/qpid/client.crt
    
  • Optionally, run these sed commands to set the configuration changes directly from the command line:

    # sed -i "s/^require-encryption=yes/#require-encryption=yes/g" /etc/qpid/qpidd.conf
    # sed -i "s/^celery_require_ssl: true/#celery_require_ssl: true/g" /etc/pulp/server.conf
    # sed -i 's!^url: ssl://localhost:5671!url: tcp://localhost:5672!g' /etc/pulp/server.conf
    # sed -i 's!^broker_url: qpid://localhost:5671!broker_url: qpid://localhost:5672!g' /etc/pulp/server.conf
    
  • Restart all Satellite services to apply the change (it is recommended no pulp task is in progress):

    foreman-maintain service restart
    
  • Recall that any execution of satellite-installer (or any Sat upgrade) will reset that change, that must be re-applied afterwards.

For more KB articles/solutions related to Red Hat Satellite 6.x Pulp 2.0 Issues, please refer to the Consolidated Troubleshooting Article for Red Hat Satellite 6.x Pulp 2.0-related Issues

Root Cause

There is a This content is not included.bug in python-qpid where concurrent sending of huge messages over SSL can cause either sending gets stuck forever.

Diagnostic Steps

The below diagnostic steps can be checked by the attached test_resource-manager-stuck.sh script.

  1. Check qpid-stat --ssl-certificate=/etc/pki/pulp/qpid/client.crt -b amqps://localhost:5671 -q | grep resource (or in sosreport sos_commands/pulp/qpid-stat_-q_--ssl-certificate_.etc.pki.pulp.qpid.client.crt_-b_amqps_..localhost_5671 file) for pending messages with pulp tasks, like:
  reserved_resource_worker-0@satellite.example.com.celery.pidbox       Y                 0     21     21       0   11.8k    11.8k        1     2
  reserved_resource_worker-0@satellite.example.com.dq2            Y                      0   3.09k  3.09k      0   13.3m    13.3m        1     2
  reserved_resource_worker-1@satellite.example.com.celery.pidbox       Y                 0     21     21       0   11.8k    11.8k        1     2
  reserved_resource_worker-1@satellite.example.com.dq2            Y                      0   3.00k  3.00k      0   12.9m    12.9m        1     2
  reserved_resource_worker-2@satellite.example.com.celery.pidbox       Y                 0     21     21       0   11.8k    11.8k        1     2
  reserved_resource_worker-2@satellite.example.com.dq2            Y                      0   3.31k  3.31k      0   12.7m    12.7m        1     2
  reserved_resource_worker-3@satellite.example.com.celery.pidbox       Y                 0     21     21       0   11.8k    11.8k        1     2
  reserved_resource_worker-3@satellite.example.com.dq2            Y                      0   18.6k  18.6k      0   67.7m    67.7m        1     2
  resource_manager                                                 Y                   7.50k  21.5k  14.0k   11.9m   107m    95.4m        1     2
  resource_manager@satellite.example.com.celery.pidbox                 Y                21     21      0    11.8k  11.8k       0         1     2
  resource_manager@satellite.example.com.dq2                      Y                      0      0      0       0      0        0         1     2

If:

  • second right-most columns have only ones
  • resource_manager queue has many messages (7.50k here)
  • but no worker have either message

then the resource_manager pair of processes got most probably stuck.

  1. Further, check in sos_commands/pulp/mongo-reserved_resources (or in command output mongo pulp_database --eval "db.reserved_resources.find().pretty()") that it contains exactly one record, like:
MongoDB shell version v3.4.9
connecting to: mongodb://localhost:27017/pulp_database
MongoDB server version: 3.4.9
{
        "_id" : "45e5b2be-8837-4a13-9a27-2a09ef4eafda",
        "worker_name" : "reserved_resource_worker-2@satellite.example.com",
        "resource_id" : "repository:064b3821-82b6-40f3-8139-8cf9a415f474",
        "_ns" : "reserved_resources"
}

Note however, that we have noticed also an instance of the problem with reserved_resource being empty.

  1. That task of given _id is usually in sos_commands/pulp/pulp-running_tasks (or in command output mongo pulp_database --eval "db.task_status.find({state:{\$ne: \"finished\"}}).pretty()") - sometimes the command output is truncated and the task id can be skipped. If it is present, it is waiting:
{
        "_id" : ObjectId("6058f77902c95c69450cec38"),
        "task_id" : "45e5b2be-8837-4a13-9a27-2a09ef4eafda",
        "exception" : null,
        "task_type" : "pulp.server.managers.repo.sync.sync",
        "tags" : [
                "pulp:repository:064b3821-82b6-40f3-8139-8cf9a415f474",
                "pulp:action:sync"
        ],
        "finish_time" : null,
        "traceback" : null,
        "spawned_tasks" : [ ],
        "progress_report" : {

        },
        "worker_name" : null,
        "result" : null,
        "error" : null,
        "group_id" : null,
        "state" : "waiting",
        "start_time" : null
}
SBR
Product(s)
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.