[Satellite6] Various tasks are pending on "waiting for Pulp to start the task"
Environment
- Red Hat Satellite 6.1 or newer
Issue
- various tasks are stuck, including:
- repository sync from CDN
- capsule sync
- uploading a content to a custom repository
- applying an errata or installing a package from Satellite to a content host
- all such tasks are pending on a dynflow subtask with
"waiting for Pulp to start the task"
Resolution
The quite probable cause is pulp celery resource_manager process got stuck. See Diagnostic Steps to confirm this theory first. If so, this issue is tracked in This content is not included.this and This content is not included.this and This content is not included.this bugzillas with a workaround to restart pulp services.
Before applying the workaround, to help Red Hat investigating the root cause, please collect below data and provide to Red Hat support:
- upload to the Satellite the two attachments
katello-debug.shandqpid-core-dump.shand then run:
mv katello-debug.sh /usr/share/foreman/script/foreman-debug.d/katello-debug.sh #overwrite the file to get enhanced debugging
chmod a+x /usr/share/foreman/script/foreman-debug.d/katello-debug.sh qpid-core-dump.sh
./qpid-core-dump.sh
foreman-debug
- provide the two archives from both
qpid-core-dump.shandforeman-debug
Workaround itself: restart pulp services:
for i in pulp_resource_manager pulp_workers pulp_celerybeat; do service $i restart; done
Potential resolution
Upgrade to python-qpid-0.30-11 available via this errata and restart pulp services to apply the change. This errata is supposed to fix several deadlocks (some already fixed in previous errata).
For more KB articles/solutions related to Red Hat Satellite 6.x Pulp 2.0 Issues, please refer to the Consolidated Troubleshooting Article for Red Hat Satellite 6.x Pulp 2.0-related Issues
Root Cause
Pulp celery resource_manager responsible for dispatching pulp jobs among worker threads is supposed to be stuck. Therefore incoming requests to pulp are not passed to the pulp workers.
An alternative scenario is all workers are stuck the same / similar way, causing resource_manager can't dispatch its work to any responding worker.
Diagnostic Steps
-
stuck tasks are pending on sub-task with
"waiting for Pulp to start the task"string -
qpid-stat -q --ssl-certificate=/etc/pki/pulp/qpid/client.crt -b amqps://localhost:5671showsresource_managerqueue depth is non-trivial (say bigger than 10 - see the left most number, 489 in example below):
resource_manager Y 489 16.8k 16.4k 23.6m 1.04g 1.02g 1 2
..
qpid-stat -q --ssl-certificate=/etc/pki/pulp/qpid/client.crt -b amqps://localhost:5671 resource_managershowsresource_managerqueue has a subscription / consumer that acquired nontrivial amount of messages (but less than the queue depth is) - see latest column, in our case 333:
1 resource_manager qpid.1.2.3.4:5671-1.2.3.4:46742 __main__.py 29643 Y CREDIT 16,689 333
-
note however, that above symptoms are not conclusive proof of pulp tasks being stuck - there is no simple check of that. But everytime pulp tasks are stuck, the above symptoms are present
-
a little bit better check is:
# qpid-stat -q --ssl-certificate=/etc/pki/pulp/qpid/client.crt -b amqps://localhost:5671 resource_manager
Properties:
Name Durable AutoDelete Exclusive FlowStopped FlowStoppedCount Consumers Bindings
======================================================================================================
resource_manager Y N N N 0 1 2
Optional Properties:
Property Value
============================================================================
arguments {u'passive': False, u'exclusive': False, u'arguments': None}
alt-exchange
Statistics:
Statistic Messages Bytes
==============================================
queue-depth 0 0
total-enqueues 21 33,860
total-dequeues 21 33,860
persistent-enqueues 0 0
persistent-dequeues 0 0
transactional-enqueues 0 0
transactional-dequeues 0 0
flow-to-disk-depth 0 0
flow-to-disk-enqueues 0 0
flow-to-disk-dequeues 0 0
acquires 21
releases 0
discards-ttl-expired 0
discards-limit-overflow 0
discards-ring-overflow 0
discards-lvq-replace 0
discards-subscriber-reject 0
discards-purged 0
reroutes 0
#
- if
acquiresis smaller thantotal-enqueuesandacquiresis not increasing over the time (whiletotal-enqueuesis), you most probably hit the deadlock /pulp tasks stuck.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.