Capsule Sync Stuck Due to Network Issues

Solution Verified - Updated

Environment

  • Red Hat Satellite
    • 6.12
    • 6.13
    • 6.14
    • 6.15
    • 6.16
    • 6.17

Issue

  • Having a bit unreliable network between Satellite and Capsule, a Capsule Sync gets stuck at random phase.

Resolution

  • To prevent such situation happening, it is recommended to discuss with network or firewall team and try to make the network more reliable/stable.

Workaround for Red Hat Satellite 6.12 to 6.14:


Please follow the steps below on the **Capsule**:
  1. Run the following SQL update statement to set some timeouts for the connections:

    # su - postgres -c "psql pulpcore -c \"update core_remote set connect_timeout = 90, sock_connect_timeout = 90, sock_read_timeout = 90, total_timeout = 300;\""
    
  2. To unblock already stuck Capsule Sync, restart pulpcore workers:

    # systemctl restart pulpcore-worker@*.service
    

Note:
For Satellite 6.12+, the timeouts above should persist for the future Capsule syncs, but new repositories won't have the timeouts set. Therefore, it is recommended to run the SQL update statement again before syncing new repositories to the Capsule.

*Workaround for Red Hat Satellite 6.15 to 6.17:

  1. On the Capsule server, apply the workaround for Red Hat Satellite 6.12 to 6.14 as described above.

  2. To ensure that the timeout settings persist for the future Capsule syncs, please run the following hammer commands on the Satellite:

    # hammer setting set --name sync_connect_timeout_v2 --value 90
    # hammer setting set --name sync_sock_connect_timeout --value 90
    # hammer setting set --name sync_sock_read_timeout --value 90
    # hammer setting set --name sync_total_timeout --value 300
    

For more KB articles/solutions related to Red Hat Satellite 6.x Capsule Sync Issues, please refer to the Consolidated Troubleshooting Article for Red Hat Satellite 6.x Capsule Sync Issues

Diagnostic Steps

  • The most definitive proof the bug is hit is taking coredumps of pulpcore-worker processes on the Capsule and checking if the child processes have backtrace described in This content is not included.jira SAT-21126.
    Note: The snaphot of the core_progressreport is captured by sosreport in sos_commands/pulpcore/core_progressreport file.

  • An Optional very good check is to run the following PostgreSQL query repeatedly on the Capsule:

    # su - postgres -c "psql pulpcore -c \"SELECT pulp_created,pulp_last_updated,message,state,total,done,task_id FROM core_progressreport WHERE state = 'running' ORDER BY pulp_last_updated ASC;\""
    

    (optionally order by task_id, whatever fits better to reader)

    The output will look like:

             pulp_created          |       pulp_last_updated       |        message        |  state  | total | done |               task_id                
    -------------------------------+-------------------------------+-----------------------+---------+-------+------+--------------------------------------
     2024-06-05 09:15:57.978105+02 | 2024-06-05 09:15:57.978131+02 | Associating Content   | running |       |    0 | 0fc79718-8d49-46e7-821a-2e6382e71b4b
     2024-06-05 09:17:09.655129+02 | 2024-06-05 09:17:09.655152+02 | Associating Content   | running |       |    0 | 93f8ddb3-d9ec-41c9-9a3e-c82cf2068f7e
     2024-06-05 09:17:23.461927+02 | 2024-06-05 09:17:45.145516+02 | Parsed Packages       | running |  6430 | 2504 | 93f8ddb3-d9ec-41c9-9a3e-c82cf2068f7e
     2024-06-05 09:16:12.13271+02  | 2024-06-05 09:17:45.162221+02 | Parsed Packages       | running | 17599 | 9496 | 0fc79718-8d49-46e7-821a-2e6382e71b4b
     2024-06-05 09:15:57.937152+02 | 2024-06-05 09:17:45.166293+02 | Downloading Artifacts | running |       |    0 | 0fc79718-8d49-46e7-821a-2e6382e71b4b
     2024-06-05 09:17:09.640214+02 | 2024-06-05 09:17:47.397037+02 | Downloading Artifacts | running |       |    0 | 93f8ddb3-d9ec-41c9-9a3e-c82cf2068f7e
    (6 rows)
    
  • Run the same command after a minute. If the output will be the same, esp. pulp_last_updated not changed and the total/done counts are not moving, then the bug is most probably hit.

  • An indirect (and not always present) evidence of the bug are these logs in /var/log/messages (or in journal):

    Nov 20 07:04:42 capsule pulpcore-worker[494938]: Backing off download_wrapper(...) for 0.3s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host satellite.example.com:443 ssl:default [None])
    Nov 20 07:04:42 capsule pulpcore-worker[494938]: pulp [72d41791d3174160a142a84ba5f3ef8b]: backoff:INFO: Backing off download_wrapper(...) for 0.3s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host satellite.example.com:443 ssl:default [None])
    Nov 20 07:06:53 capsule pulpcore-worker[494938]: Backing off download_wrapper(...) for 0.3s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host satellite.example.com:443 ssl:default [Connect call failed ('192.168.x.x', 443)])
    
SBR
Product(s)
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.