[Satellite6] Various tasks are pending on "waiting for Pulp to start the task"

Solution Verified - Updated 2 Aug 2024

Environment

Red Hat Satellite 6.1 or newer

Issue

various tasks are stuck, including:
- repository sync from CDN
- capsule sync
- uploading a content to a custom repository
- applying an errata or installing a package from Satellite to a content host
all such tasks are pending on a dynflow subtask with "waiting for Pulp to start the task"

Resolution

The quite probable cause is pulp celery resource_manager process got stuck. See Diagnostic Steps to confirm this theory first. If so, this issue is tracked in This content is not included.this and This content is not included.this and This content is not included.this bugzillas with a workaround to restart pulp services.

Before applying the workaround, to help Red Hat investigating the root cause, please collect below data and provide to Red Hat support:

upload to the Satellite the two attachments katello-debug.sh and qpid-core-dump.sh and then run:

mv katello-debug.sh /usr/share/foreman/script/foreman-debug.d/katello-debug.sh  #overwrite the file to get enhanced debugging
chmod a+x /usr/share/foreman/script/foreman-debug.d/katello-debug.sh qpid-core-dump.sh
./qpid-core-dump.sh
foreman-debug

provide the two archives from both qpid-core-dump.sh and foreman-debug

Workaround itself: restart pulp services:

for i in pulp_resource_manager pulp_workers pulp_celerybeat; do service $i restart; done

Potential resolution
Upgrade to python-qpid-0.30-11 available via this errata and restart pulp services to apply the change. This errata is supposed to fix several deadlocks (some already fixed in previous errata).

For more KB articles/solutions related to Red Hat Satellite 6.x Pulp 2.0 Issues, please refer to the Consolidated Troubleshooting Article for Red Hat Satellite 6.x Pulp 2.0-related Issues

Root Cause

Pulp celery resource_manager responsible for dispatching pulp jobs among worker threads is supposed to be stuck. Therefore incoming requests to pulp are not passed to the pulp workers.

An alternative scenario is all workers are stuck the same / similar way, causing resource_manager can't dispatch its work to any responding worker.

Diagnostic Steps

stuck tasks are pending on sub-task with "waiting for Pulp to start the task" string
qpid-stat -q --ssl-certificate=/etc/pki/pulp/qpid/client.crt -b amqps://localhost:5671 shows resource_manager queue depth is non-trivial (say bigger than 10 - see the left most number, 489 in example below):

  resource_manager                                                  Y                    489  16.8k  16.4k   23.6m  1.04g    1.02g        1     2
..

qpid-stat -q --ssl-certificate=/etc/pki/pulp/qpid/client.crt -b amqps://localhost:5671 resource_manager shows resource_manager queue has a subscription / consumer that acquired nontrivial amount of messages (but less than the queue depth is) - see latest column, in our case 333:

  1       resource_manager                                                  qpid.1.2.3.4:5671-1.2.3.4:46742     __main__.py       29643           Y            CREDIT      16,689     333

note however, that above symptoms are not conclusive proof of pulp tasks being stuck - there is no simple check of that. But everytime pulp tasks are stuck, the above symptoms are present
a little bit better check is:

# qpid-stat -q --ssl-certificate=/etc/pki/pulp/qpid/client.crt -b amqps://localhost:5671 resource_manager
Properties:
  Name              Durable  AutoDelete  Exclusive  FlowStopped  FlowStoppedCount  Consumers  Bindings
  ======================================================================================================
  resource_manager  Y        N           N          N            0                 1          2

Optional Properties:
  Property      Value
  ============================================================================
  arguments     {u'passive': False, u'exclusive': False, u'arguments': None}
  alt-exchange  

Statistics:
  Statistic                   Messages  Bytes
  ==============================================
  queue-depth                 0         0
  total-enqueues              21        33,860
  total-dequeues              21        33,860
  persistent-enqueues         0         0
  persistent-dequeues         0         0
  transactional-enqueues      0         0
  transactional-dequeues      0         0
  flow-to-disk-depth          0         0
  flow-to-disk-enqueues       0         0
  flow-to-disk-dequeues       0         0
  acquires                    21        
  releases                    0         
  discards-ttl-expired        0         
  discards-limit-overflow     0         
  discards-ring-overflow      0         
  discards-lvq-replace        0         
  discards-subscriber-reject  0         
  discards-purged             0         
  reroutes                    0         
#

if acquires is smaller than total-enqueues and acquires is not increasing over the time (while total-enqueues is), you most probably hit the deadlock /pulp tasks stuck.

SBR

SysMgmt

Product(s)

Red Hat Satellite

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.