Satellite Unresponsive when Candlepin Filesystem Hits Mask Disk Usage, Artemis Queue Blocked after upgrade to Red Hat Satellite 6

Solution Verified - Updated

Environment

  • Red Hat Satellite 6.8+

Issue

  • Clients are timing out when communicating/registering to the Satellite server

  • Httpd processes have increased significantly on the Satellite server

  • Satellite was recently upgraded to 6.8
    The upgrade went smoothly, everything was working well but after some time, all services are timing out.
    If I restart satellite services I can log in initially and hammer commands work, after a few minutes it becomes unresponsive again

  • httpd service intermittently failing:

    Dec 15 09:59:38 sat6 systemd: httpd.service stop-sigterm timed out. Killing.
    Dec 15 09:59:38 sat6 systemd: httpd.service: main process exited, code=killed, status=9/KILL
    Dec 15 09:59:38 sat6 systemd: Stopped The Apache HTTP Server.
    
  • Candlepin stops working by getting Connection reset by peer - SSL_connect error and Satellite's WebGUI stops working. Following errors logged in the log file /var/log/foreman/production.log.

    2020-11-11T10:41:33 [I|app|7b25ab57] Started GET "/rhsm/status/" for IP_ADDR at 2020-11-11 10:41:33 -0600
    2020-11-11T10:41:33 [I|app|7b25ab57] Processing by Katello::Api::Rhsm::CandlepinProxiesController#server_status as JSON
    2020-11-11T10:41:33 [D|kat|7b25ab57] Resource GET request: /candlepin/status
    2020-11-11T10:41:33 [D|kat|7b25ab57] Headers: {}
    2020-11-11T10:41:33 [D|kat|7b25ab57] Body: {}
    2020-11-11T10:41:33 [D|app|7b25ab57] RestClient.get "https://localhost:8443/candlepin/status", "Accept"=>"*/*", "Accept-Encoding"=>"gzip, deflate", "Authorization"=>"OAuth oauth_consumer_key=\"katello\", oauth_nonce=\"LWdta8OWW8w9kzAQi1CrbBiliDehntk5RwqCN5I0I\", oauth_signature=\"dclvfvAG%2Fbw1qEwhixs7VEf5j2s%3D\", oauth_signature_method=\"HMAC-SHA1\", oauth_timestamp=\"1605112893\", oauth_version=\"1.0\"", "User-Agent"=>"rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157"
    2020-11-11T11:18:49 [E|kat|7b25ab57] Errno::ECONNRESET: Connection reset by peer - SSL_connect
    2020-11-11T11:18:50 [I|app|7b25ab57] Completed 500 Internal Server Error in 2236198ms (Views: 41.1ms | ActiveRecord: 3.3ms | Allocations: 50666)
    2020-11-11T11:18:50 [D|app|7b25ab57] With body: {"displayMessage":"Connection reset by peer - SSL_connect","errors":["Connection reset by peer - SSL_connect"]}
    

Resolution

This issue is reported in This content is not included.Bug 1897344, This content is not included.Bug 1898605 and finally has been fixed in the Red Hat Satellite 6.9.4+ via This content is not included.Bugzilla 1973362.

Solution 1:

Solution 2: [ If upgrade is not an option)

  • Increase the disk space for the disk that the /var/lib/candlepin/ filesystem resides on. If there is a leftover disk on the system, users can extend the current logical volume which contains the /var/lib/candlepin filesystem. Otherwise, an additional virtual/physical disk will need to be added to extend the filesystem. If extending the filesystem, please be sure to turn the Satellite services off with this command:

    # foreman-maintain service stop
    
  • And to restart the service after the logical volume and filesystem have been extended:

    # foreman-maintain service start
    
  • If unable to increase the disk space, as a temporary and immediate workaround, users can edit the max-disk-usage default value from 90 to 99. Users will need to edit the /etc/candlepin/broker.xml file on the Satellite. Locate the following line:

            	<large-messages-directory>/var/lib/candlepin/activemq-artemis/largemsgs</large-messages-directory>
    	        <paging-directory>/var/lib/candlepin/activemq-artemis/paging</paging-directory>
    
            	<addresses>
    	            <address name="event.default">
            	        <multicast>
    
  • And insert the following line to include the max-disk-usage variable:

            	<large-messages-directory>/var/lib/candlepin/activemq-artemis/largemsgs</large-messages-directory>
          	  	<paging-directory>/var/lib/candlepin/activemq-artemis/paging</paging-directory>
    
    	        <max-disk-usage>99</max-disk-usage>            <--- Insert this line
    
    	        <addresses>
    	            <address name="event.default">
    	                <multicast>
    
  • Then restart the tomcat service for the change to take effect:

    # systemctl restart tomcat
    
  • Please note, this is only delaying the issue until the disk space hits 99%. As an ultimate resolution, the disk space should be increased. This method is most effective when the disk space is a large value (i.e. 1 TB), so the extra 9% will be roughly 90 GB.

 

For more KB articles/solutions related to Red Hat Satellite 6.x Candlepin Issues, please refer to the Consolidated Troubleshooting Article for Red Hat Satellite 6.x Candlepin Issues.

Root Cause

  • The disk containing the /var/lib/candlepin filesystem has reached/surpassed the 90% filesystem limit. Once this happens, Artemis blocks the event.default address. This makes registering clients fail, and other usual Satellite communications to not respond normally.

Diagnostic Steps

  • Check the /var/log/candlepin/error.log for repeated log lines indicating a destination address is blocked:

    2020-12-15 08:54:28,490 [thread=Thread-0 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@c106471)] [=, org=, csid=] WARN  org.apache.activemq.artemis.core.server - AMQ222210: Free storage space is at 69.7GB of 697.7GB total. Usage rate is 90.0% which is beyond the configured <max-disk-usage>. System will start blocking producers.
    2020-12-15 08:54:39,383 [thread=Thread-3 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@c106471)] [=, org=, csid=] WARN  org.apache.activemq.artemis.core.server - AMQ222212: Disk Full! Blocking message production on address 'event.default'. Clients will report blocked.
    2020-12-15 08:54:49,383 [thread=http-bio-127.0.0.1-8443-exec-6] [req=6272524a-c0c4-453a-88de-f0c5b8b6a131, org=Default_Organization, csid=f67af502-71f4-430e-bf19-17c3ee0bf278] WARN  org.apache.activemq.artemis.core.client - AMQ212054: Destination address=event.default is blocked. If the system is configured to block make sure you consume messages on this configuration.
    2020-12-15 08:54:59,384 [thread=http-bio-127.0.0.1-8443-exec-6] [req=6272524a-c0c4-453a-88de-f0c5b8b6a131, org=Default_Organization, csid=f67af502-71f4-430e-bf19-17c3ee0bf278] WARN  org.apache.activemq.artemis.core.client - AMQ212054: Destination address=event.default is blocked. If the system is configured to block make sure you consume messages on this configuration.
    2020-12-15 08:55:09,385 [thread=http-bio-127.0.0.1-8443-exec-6] [req=6272524a-c0c4-453a-88de-f0c5b8b6a131, org=Default_Organization, csid=f67af502-71f4-430e-bf19-17c3ee0bf278] WARN  org.apache.activemq.artemis.core.client - AMQ212054: Destination address=event.default is blocked. If the system is configured to block make sure you consume messages on this configuration.
    
  • Also the disk containing the /var/lib/candlepin filesystem will have reached 90%:

     Filesystem                                 1K-blocks         Used                Available     Use%    Mounted on
     /dev/mapper/vgroot-varlv    838707200    759188212     79518988    91%        /var
    
  • Run the following curl command to check the response of the candlepin.
    Expected result: candlepin stops responding, no failure nor success, and after a while starting to fail with 'parse error: premature EOF' error.

    # curl -s -k https://localhost:8443/candlepin/status | json_reformat
    
SBR
Product(s)

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.