Satellite6 or Capsule stops processing pulp tasks
Environment
- Red Hat Satellite 6 or external Capsule
Issue
- Satellite's tasks like
Actions::Katello::Repository::SyncorActions::Katello::Host::UploadPackageProfileorActions::Katello::Host::GenerateApplicabilityorActions::Katello::ContentView::Publish(or Capsule sync for Capsule) are pending for many hours without a progress. - Dynflow steps are in
waiting on pulp to start the taskstatus, optionally they timeout after 1 hour withRestClient::Exceptions::ReadTimeouterror.
Resolution
Please note this workaround applies in very specific situation where all Diagnostic steps apply, only.
-
As a permanent solution, upgrade
qpid-cpppackages to1.36.0-32.el7_9amqor higher. -
As a workaround, restart
pulp_resource_managerservice:# systemctl restart pulp_resource_manager.service -
As a preventive action, one can disable SSL from pulp<->qpidd communication. Since the communication happens over loopback interface only, this remedy step is safe. To do so, apply these configuration changes:
- In
/etc/qpid/qpidd.conf, comment out linerequire-encryption=yesby adding#at the beginning of the line - In
/etc/pulp/server.conf, apply this diff ('-' means old content, '+' is new content of the line)
@@ -266,7 +266,7 @@ oauth_secret: secret # The AMQP URL for event notifications. Defaults to 'qpid://localhost:5672/'. [messaging] -url: ssl://localhost:5671 +url: tcp://localhost:5672 transport: qpid auth_enabled: false cacert: /etc/pki/pulp/qpid/ca.crt @@ -311,8 +311,8 @@ event_notifications_enabled: false # going missing incorrectly. Defaults to 30. [tasks] -broker_url: qpid://localhost:5671 -celery_require_ssl: true +broker_url: qpid://localhost:5672 +#celery_require_ssl: true cacert: /etc/pki/pulp/qpid/ca.crt keyfile: /etc/pki/pulp/qpid/client.crt certfile: /etc/pki/pulp/qpid/client.crt - In
-
Optionally, run these
sedcommands to set the configuration changes directly from the command line:# sed -i "s/^require-encryption=yes/#require-encryption=yes/g" /etc/qpid/qpidd.conf # sed -i "s/^celery_require_ssl: true/#celery_require_ssl: true/g" /etc/pulp/server.conf # sed -i 's!^url: ssl://localhost:5671!url: tcp://localhost:5672!g' /etc/pulp/server.conf # sed -i 's!^broker_url: qpid://localhost:5671!broker_url: qpid://localhost:5672!g' /etc/pulp/server.conf -
Restart all Satellite services to apply the change (it is recommended no pulp task is in progress):
foreman-maintain service restart -
Recall that any execution of
satellite-installer(or any Sat upgrade) will reset that change, that must be re-applied afterwards.
For more KB articles/solutions related to Red Hat Satellite 6.x Pulp 2.0 Issues, please refer to the Consolidated Troubleshooting Article for Red Hat Satellite 6.x Pulp 2.0-related Issues
Root Cause
There is a This content is not included.bug in python-qpid where concurrent sending of huge messages over SSL can cause either sending gets stuck forever.
Diagnostic Steps
The below diagnostic steps can be checked by the attached test_resource-manager-stuck.sh script.
- Check
qpid-stat --ssl-certificate=/etc/pki/pulp/qpid/client.crt -b amqps://localhost:5671 -q | grep resource(or in sosreportsos_commands/pulp/qpid-stat_-q_--ssl-certificate_.etc.pki.pulp.qpid.client.crt_-b_amqps_..localhost_5671file) for pending messages with pulp tasks, like:
reserved_resource_worker-0@satellite.example.com.celery.pidbox Y 0 21 21 0 11.8k 11.8k 1 2
reserved_resource_worker-0@satellite.example.com.dq2 Y 0 3.09k 3.09k 0 13.3m 13.3m 1 2
reserved_resource_worker-1@satellite.example.com.celery.pidbox Y 0 21 21 0 11.8k 11.8k 1 2
reserved_resource_worker-1@satellite.example.com.dq2 Y 0 3.00k 3.00k 0 12.9m 12.9m 1 2
reserved_resource_worker-2@satellite.example.com.celery.pidbox Y 0 21 21 0 11.8k 11.8k 1 2
reserved_resource_worker-2@satellite.example.com.dq2 Y 0 3.31k 3.31k 0 12.7m 12.7m 1 2
reserved_resource_worker-3@satellite.example.com.celery.pidbox Y 0 21 21 0 11.8k 11.8k 1 2
reserved_resource_worker-3@satellite.example.com.dq2 Y 0 18.6k 18.6k 0 67.7m 67.7m 1 2
resource_manager Y 7.50k 21.5k 14.0k 11.9m 107m 95.4m 1 2
resource_manager@satellite.example.com.celery.pidbox Y 21 21 0 11.8k 11.8k 0 1 2
resource_manager@satellite.example.com.dq2 Y 0 0 0 0 0 0 1 2
If:
- second right-most columns have only ones
resource_managerqueue has many messages (7.50k here)- but no worker have either message
then the resource_manager pair of processes got most probably stuck.
- Further, check in
sos_commands/pulp/mongo-reserved_resources(or in command outputmongo pulp_database --eval "db.reserved_resources.find().pretty()") that it contains exactly one record, like:
MongoDB shell version v3.4.9
connecting to: mongodb://localhost:27017/pulp_database
MongoDB server version: 3.4.9
{
"_id" : "45e5b2be-8837-4a13-9a27-2a09ef4eafda",
"worker_name" : "reserved_resource_worker-2@satellite.example.com",
"resource_id" : "repository:064b3821-82b6-40f3-8139-8cf9a415f474",
"_ns" : "reserved_resources"
}
Note however, that we have noticed also an instance of the problem with reserved_resource being empty.
- That task of given
_idis usually insos_commands/pulp/pulp-running_tasks(or in command outputmongo pulp_database --eval "db.task_status.find({state:{\$ne: \"finished\"}}).pretty()") - sometimes the command output is truncated and the task id can be skipped. If it is present, it is waiting:
{
"_id" : ObjectId("6058f77902c95c69450cec38"),
"task_id" : "45e5b2be-8837-4a13-9a27-2a09ef4eafda",
"exception" : null,
"task_type" : "pulp.server.managers.repo.sync.sync",
"tags" : [
"pulp:repository:064b3821-82b6-40f3-8139-8cf9a415f474",
"pulp:action:sync"
],
"finish_time" : null,
"traceback" : null,
"spawned_tasks" : [ ],
"progress_report" : {
},
"worker_name" : null,
"result" : null,
"error" : null,
"group_id" : null,
"state" : "waiting",
"start_time" : null
}
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.