[Satellite 6] There are 0 pulp_celerybeat processes running

Solution Verified - Updated

Environment

  • Satellite 6.3

Issue

Soon after katello-service restart or foreman-maintain service restart pulp services are reported in status FAIL:

# hammer ping
candlepin:      
    Status:          ok
    Server Response: Duration: 14ms
candlepin_auth: 
    Status:          ok
    Server Response: Duration: 31ms
pulp:           
    Status:          FAIL
    Server Response:
pulp_auth:      
    Status: FAIL
foreman_tasks:  
    Status:          ok
    Server Response: Duration: 16ms

Resolution

IMPORTANT: These resolution steps should be used only when you see following in the logs while pulp processes are actually running and restarting pulp services didn't help.

There are 0 pulp_celerybeat processes running. Pulp will not operate correctly without at least one pulp_celerybeat process running.

WARNING: this will remove the data about all the unfinished tasks, e.g. when the repository sync is in progress when purging content of the queues or installing the errata on the client, task details will be lost.

  • Purge content of all qpid queues, following the steps in [Satellite6] How to purge a qpid queue content? , after we added certificates as described there, we can use simple for loop to remove messages from all qpid queues (please note that the path for drain binary may differ, depending on your python-qpid version):
 # katello-service stop --exclude=qpidd
 # for i in $(qpid-stat -q --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 | tail -n +4 | awk {'print $1'}); do /usr/share/doc/python-qpid-1.35.0/examples/api/drain -b "amqps://localhost:5671" "$i; { node:{type:queue}}"; done
 # katello-service start --exclude=qpidd

For more KB articles/solutions related to Red Hat Satellite 6.x Pulp 2.0 Issues, please refer to the Consolidated Troubleshooting Article for Red Hat Satellite 6.x Pulp 2.0-related Issues

Root Cause

Scheduler thread got stucked in the selector code inside the qpid python package.
It seems like it's hanging when trying to interact with qpid.

Diagnostic Steps

# hammer ping
  • Check if all workers, including scheduler worker are reported over Pulp API
# curl -v https://$(hostname -f)/pulp/api/v2/status/ | json_reformat
# journalctl -u pulp_celerybeat.service 
satellite.example.com pulp[30277]: pulp.server.async.scheduler:DEBUG: Checking if pulp_workers, pulp_celerybeat, or pulp_resource_manager processes are missing for more than 240 seconds
satellite.example.com pulp[30277]: pulp.server.async.scheduler:ERROR: Worker 'scheduler@satellite.example.com' has gone missing, removing from list of workers
satellite.example.com pulp[30277]: pulp.server.async.scheduler:ERROR: There are 0 pulp_celerybeat processes running. Pulp will not operate correctly without at least one pulp_celerybeat process running.

As a last resort, if the configuration is correct, and pulp processes are running, we can collect coredumps from celery worker and celery beat processes:

# for pid in $(ps -awfux| grep celery | awk '{ print $2 }'); do gcore $pid; done
SBR
Product(s)
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.