Satellite Unresponsive when Candlepin Filesystem Hits Mask Disk Usage, Artemis Queue Blocked after upgrade to Red Hat Satellite 6
Environment
- Red Hat Satellite 6.8+
Issue
-
Clients are timing out when communicating/registering to the Satellite server
-
Httpd processes have increased significantly on the Satellite server
-
Satellite was recently upgraded to 6.8
The upgrade went smoothly, everything was working well but after some time, all services are timing out.
If I restart satellite services I can log in initially and hammer commands work, after a few minutes it becomes unresponsive again -
httpd service intermittently failing:
Dec 15 09:59:38 sat6 systemd: httpd.service stop-sigterm timed out. Killing. Dec 15 09:59:38 sat6 systemd: httpd.service: main process exited, code=killed, status=9/KILL Dec 15 09:59:38 sat6 systemd: Stopped The Apache HTTP Server. -
Candlepin stops working by getting
Connection reset by peer - SSL_connecterror and Satellite's WebGUI stops working. Following errors logged in the log file/var/log/foreman/production.log.2020-11-11T10:41:33 [I|app|7b25ab57] Started GET "/rhsm/status/" for IP_ADDR at 2020-11-11 10:41:33 -0600 2020-11-11T10:41:33 [I|app|7b25ab57] Processing by Katello::Api::Rhsm::CandlepinProxiesController#server_status as JSON 2020-11-11T10:41:33 [D|kat|7b25ab57] Resource GET request: /candlepin/status 2020-11-11T10:41:33 [D|kat|7b25ab57] Headers: {} 2020-11-11T10:41:33 [D|kat|7b25ab57] Body: {} 2020-11-11T10:41:33 [D|app|7b25ab57] RestClient.get "https://localhost:8443/candlepin/status", "Accept"=>"*/*", "Accept-Encoding"=>"gzip, deflate", "Authorization"=>"OAuth oauth_consumer_key=\"katello\", oauth_nonce=\"LWdta8OWW8w9kzAQi1CrbBiliDehntk5RwqCN5I0I\", oauth_signature=\"dclvfvAG%2Fbw1qEwhixs7VEf5j2s%3D\", oauth_signature_method=\"HMAC-SHA1\", oauth_timestamp=\"1605112893\", oauth_version=\"1.0\"", "User-Agent"=>"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157" 2020-11-11T11:18:49 [E|kat|7b25ab57] Errno::ECONNRESET: Connection reset by peer - SSL_connect 2020-11-11T11:18:50 [I|app|7b25ab57] Completed 500 Internal Server Error in 2236198ms (Views: 41.1ms | ActiveRecord: 3.3ms | Allocations: 50666) 2020-11-11T11:18:50 [D|app|7b25ab57] With body: {"displayMessage":"Connection reset by peer - SSL_connect","errors":["Connection reset by peer - SSL_connect"]}
Resolution
This issue is reported in This content is not included.Bug 1897344, This content is not included.Bug 1898605 and finally has been fixed in the Red Hat Satellite 6.9.4+ via This content is not included.Bugzilla 1973362.
Solution 1:
- Please This content is not included.upgrade to the latest minor version of Red Hat Satellite 6.9 or above to have the permanent fix applied.
Solution 2: [ If upgrade is not an option)
-
Increase the disk space for the disk that the
/var/lib/candlepin/filesystem resides on. If there is a leftover disk on the system, users can extend the current logical volume which contains the/var/lib/candlepinfilesystem. Otherwise, an additional virtual/physical disk will need to be added to extend the filesystem. If extending the filesystem, please be sure to turn the Satellite services off with this command:# foreman-maintain service stop -
And to restart the service after the logical volume and filesystem have been extended:
# foreman-maintain service start -
If unable to increase the disk space, as a temporary and immediate workaround, users can edit the
max-disk-usagedefault value from 90 to 99. Users will need to edit the/etc/candlepin/broker.xmlfile on the Satellite. Locate the following line:<large-messages-directory>/var/lib/candlepin/activemq-artemis/largemsgs</large-messages-directory> <paging-directory>/var/lib/candlepin/activemq-artemis/paging</paging-directory> <addresses> <address name="event.default"> <multicast> -
And insert the following line to include the
max-disk-usagevariable:<large-messages-directory>/var/lib/candlepin/activemq-artemis/largemsgs</large-messages-directory> <paging-directory>/var/lib/candlepin/activemq-artemis/paging</paging-directory> <max-disk-usage>99</max-disk-usage> <--- Insert this line <addresses> <address name="event.default"> <multicast> -
Then restart the tomcat service for the change to take effect:
# systemctl restart tomcat -
Please note, this is only delaying the issue until the disk space hits 99%. As an ultimate resolution, the disk space should be increased. This method is most effective when the disk space is a large value (i.e. 1 TB), so the extra 9% will be roughly 90 GB.
For more KB articles/solutions related to Red Hat Satellite 6.x Candlepin Issues, please refer to the Consolidated Troubleshooting Article for Red Hat Satellite 6.x Candlepin Issues.
Root Cause
- The disk containing the
/var/lib/candlepinfilesystem has reached/surpassed the 90% filesystem limit. Once this happens, Artemis blocks theevent.defaultaddress. This makes registering clients fail, and other usual Satellite communications to not respond normally.
Diagnostic Steps
-
Check the
/var/log/candlepin/error.logfor repeated log lines indicating a destination address is blocked:2020-12-15 08:54:28,490 [thread=Thread-0 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@c106471)] [=, org=, csid=] WARN org.apache.activemq.artemis.core.server - AMQ222210: Free storage space is at 69.7GB of 697.7GB total. Usage rate is 90.0% which is beyond the configured <max-disk-usage>. System will start blocking producers. 2020-12-15 08:54:39,383 [thread=Thread-3 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@c106471)] [=, org=, csid=] WARN org.apache.activemq.artemis.core.server - AMQ222212: Disk Full! Blocking message production on address 'event.default'. Clients will report blocked. 2020-12-15 08:54:49,383 [thread=http-bio-127.0.0.1-8443-exec-6] [req=6272524a-c0c4-453a-88de-f0c5b8b6a131, org=Default_Organization, csid=f67af502-71f4-430e-bf19-17c3ee0bf278] WARN org.apache.activemq.artemis.core.client - AMQ212054: Destination address=event.default is blocked. If the system is configured to block make sure you consume messages on this configuration. 2020-12-15 08:54:59,384 [thread=http-bio-127.0.0.1-8443-exec-6] [req=6272524a-c0c4-453a-88de-f0c5b8b6a131, org=Default_Organization, csid=f67af502-71f4-430e-bf19-17c3ee0bf278] WARN org.apache.activemq.artemis.core.client - AMQ212054: Destination address=event.default is blocked. If the system is configured to block make sure you consume messages on this configuration. 2020-12-15 08:55:09,385 [thread=http-bio-127.0.0.1-8443-exec-6] [req=6272524a-c0c4-453a-88de-f0c5b8b6a131, org=Default_Organization, csid=f67af502-71f4-430e-bf19-17c3ee0bf278] WARN org.apache.activemq.artemis.core.client - AMQ212054: Destination address=event.default is blocked. If the system is configured to block make sure you consume messages on this configuration. -
Also the disk containing the
/var/lib/candlepinfilesystem will have reached 90%:Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/vgroot-varlv 838707200 759188212 79518988 91% /var -
Run the following curl command to check the response of the candlepin.
Expected result: candlepin stops responding, no failure nor success, and after a while starting to fail with 'parse error: premature EOF' error.# curl -s -k https://localhost:8443/candlepin/status | json_reformat
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.