Capsule Sync Stuck Due to Network Issues
Environment
- Red Hat Satellite
- 6.12
- 6.13
- 6.14
- 6.15
- 6.16
- 6.17
Issue
- Having a bit unreliable network between Satellite and Capsule, a Capsule Sync gets stuck at random phase.
Resolution
-
This issue has been reported via This content is not included.jira SAT-21126 and is fixed in the errata RHSA-2025:19721 for Red Hat Satellite 6.18.0.
-
This content is not included.Upgrade the Red Hat Satellite server to version 6.18.z to fix the reported issue.
- To prevent such situation happening, it is recommended to discuss with network or firewall team and try to make the network more reliable/stable.
Workaround for Red Hat Satellite 6.12 to 6.14:
Please follow the steps below on the **Capsule**:
-
Run the following SQL update statement to set some timeouts for the connections:
# su - postgres -c "psql pulpcore -c \"update core_remote set connect_timeout = 90, sock_connect_timeout = 90, sock_read_timeout = 90, total_timeout = 300;\"" -
To unblock already stuck Capsule Sync, restart pulpcore workers:
# systemctl restart pulpcore-worker@*.service
Note:
For Satellite 6.12+, the timeouts above should persist for the future Capsule syncs, but new repositories won't have the timeouts set. Therefore, it is recommended to run the SQL update statement again before syncing new repositories to the Capsule.
*Workaround for Red Hat Satellite 6.15 to 6.17:
-
On the Capsule server, apply the workaround for Red Hat Satellite 6.12 to 6.14 as described above.
-
To ensure that the timeout settings persist for the future Capsule syncs, please run the following hammer commands on the Satellite:
# hammer setting set --name sync_connect_timeout_v2 --value 90 # hammer setting set --name sync_sock_connect_timeout --value 90 # hammer setting set --name sync_sock_read_timeout --value 90 # hammer setting set --name sync_total_timeout --value 300
For more KB articles/solutions related to Red Hat Satellite 6.x Capsule Sync Issues, please refer to the Consolidated Troubleshooting Article for Red Hat Satellite 6.x Capsule Sync Issues
Diagnostic Steps
-
The most definitive proof the bug is hit is taking coredumps of
pulpcore-workerprocesses on the Capsule and checking if the child processes have backtrace described in This content is not included.jira SAT-21126.
Note: The snaphot of thecore_progressreportis captured by sosreport insos_commands/pulpcore/core_progressreportfile. -
An Optional very good check is to run the following PostgreSQL query repeatedly on the Capsule:
# su - postgres -c "psql pulpcore -c \"SELECT pulp_created,pulp_last_updated,message,state,total,done,task_id FROM core_progressreport WHERE state = 'running' ORDER BY pulp_last_updated ASC;\""(optionally order by
task_id, whatever fits better to reader)The output will look like:
pulp_created | pulp_last_updated | message | state | total | done | task_id -------------------------------+-------------------------------+-----------------------+---------+-------+------+-------------------------------------- 2024-06-05 09:15:57.978105+02 | 2024-06-05 09:15:57.978131+02 | Associating Content | running | | 0 | 0fc79718-8d49-46e7-821a-2e6382e71b4b 2024-06-05 09:17:09.655129+02 | 2024-06-05 09:17:09.655152+02 | Associating Content | running | | 0 | 93f8ddb3-d9ec-41c9-9a3e-c82cf2068f7e 2024-06-05 09:17:23.461927+02 | 2024-06-05 09:17:45.145516+02 | Parsed Packages | running | 6430 | 2504 | 93f8ddb3-d9ec-41c9-9a3e-c82cf2068f7e 2024-06-05 09:16:12.13271+02 | 2024-06-05 09:17:45.162221+02 | Parsed Packages | running | 17599 | 9496 | 0fc79718-8d49-46e7-821a-2e6382e71b4b 2024-06-05 09:15:57.937152+02 | 2024-06-05 09:17:45.166293+02 | Downloading Artifacts | running | | 0 | 0fc79718-8d49-46e7-821a-2e6382e71b4b 2024-06-05 09:17:09.640214+02 | 2024-06-05 09:17:47.397037+02 | Downloading Artifacts | running | | 0 | 93f8ddb3-d9ec-41c9-9a3e-c82cf2068f7e (6 rows) -
Run the same command after a minute. If the output will be the same, esp.
pulp_last_updatednot changed and thetotal/donecounts are not moving, then the bug is most probably hit. -
An indirect (and not always present) evidence of the bug are these logs in
/var/log/messages(or in journal):Nov 20 07:04:42 capsule pulpcore-worker[494938]: Backing off download_wrapper(...) for 0.3s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host satellite.example.com:443 ssl:default [None]) Nov 20 07:04:42 capsule pulpcore-worker[494938]: pulp [72d41791d3174160a142a84ba5f3ef8b]: backoff:INFO: Backing off download_wrapper(...) for 0.3s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host satellite.example.com:443 ssl:default [None]) Nov 20 07:06:53 capsule pulpcore-worker[494938]: Backing off download_wrapper(...) for 0.3s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host satellite.example.com:443 ssl:default [Connect call failed ('192.168.x.x', 443)])
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.