Geo-replication storage health-check issue

Solution Verified - Updated 13 Jun 2024

Environment

Red Hat Quay
- 3.7

Issue

In a 3 sites geo-replicated Quay environment, one of the three Quay sites went down due to a storage failure. Restarting the Quay pods in remaining 2 sites caused complete Quay shutdown. Is this expected?
In georeplicated environments, the GET /health/endtoend does not check distributed storage engines but it only checks preferred storage engine.

Resolution

Workaround 1. The failure of storage engines, from remaining 2 DCs, due to restarting quay pods is an expected behavior. To fix it add `overrides` to the QuayRegistry CRD that would disable initial validation:

spec:
- kind: quay
  managed: true
  overrides:
    env:
    - name: IGNORE_VALIDATION
      value: "true"

Note:
- The value is a boolean so it has to be in quotation marks. This will force Quay to restart even though there may be issues with specific duri was initiated. This restart also runs the config tool as the first process which does a sanity check on the config and ensures that all components that Quay hooks to are available. This is what then failed all pods. To circumvent that problem, you cang startup.
- The overrides field is not to be taken lightly and should be removed from the QuayRegistry CRD as soon as possible.

Workaround 2. Remove the offending storage engine from Quay's config.yaml file. Sample configuration with one storage engine removed looks like below:

...
DISTRIBUTED_STORAGE_CONFIG:
    default:                                            # storage name
        - RadosGWStorage                                # storage driver
        - access_key: minioadmin                        # driver parameters
          bucket_name: quay
          hostname: 10.0.0.1
          is_secure: false
          port: "9000"
          secret_key: minioadmin
          storage_path: /datastorage/registry
    swift:                                              # storage name
        - SwiftStorage                                  # storage driver
        - auth_url: http://10.0.50.50/identity          # driver parameters
          auth_version: "3"
          os_options:
DISTRIBUTED_STORAGE_DEFAULT_LOCATIONS:
    - default
    - swift
DISTRIBUTED_STORAGE_PREFERENCE:
    - default
    - swift
...

To successfully remove a certain storage engine, one must remove the storage name, driver and all related parameters to that driver from the Quay config.yaml file. Also remove the storage driver name from DISTRIBUTED_STORAGE_DEFAULT_LOCATIONS and DISTRIBUTED_STORAGE_PREFERENCE fields.
Note:
- This change must be done on all Quay instances you're running. Quay pods should come online afterwards.
- Images that are completely stored in the failed DC will not be pullable.
- Georeplication is an async operation, it happens in batches and after the image has been completely pushed to the registry. There is no guarantee that all blobs for all images pushed to the failed DC were transferred to other storage locations in time. If such an image is encountered, it should be repushed to Quay again.
- After the failed storage engine has been restored, the configuration for that storage engine should be restored to remaining 2 Quay instances and Quay should be restarted. One needs to enqueue blobs that are now in the remaining two DCs to be pushed to the failed DC. This can be done with the following script:
```
$ oc exec -it quay-pod-name -- pythom -m util.backfillreplication
```

Proposed solution: This content is not included.PROJQUAY-5074. It comprises of following:

The GET /health/instance will check each instance's preferred storage engine. Example: In DC A, it will check storage A, in DC B storage B and so on. If DC B storage fails, the new health check will automatically fail Quay pods running in DC B, while other pods will continue to function.
Change in Quay's logic when it comes to the end-to-end health check. If N-1 storage engines fail, the health check should return a 200 with a warning saying service is degraded and a list of storage engines that are not available. Local instances whose storage engines failed should already be removed from the load balancing scheme because their instance check already failed. Instead of just checking preferred storage which depends on the instance that we're hitting, it will now check all storage engines defined in Quay's config.yaml file.

Root Cause

Remaining two DCs fails because a restart was initiated on their Quay pods. The restart runs the config tool as the first process which does a sanity check on the configuration and ensures that all components that Quay hooks to are available. As the check fails, quay pods fail. The check is GET /health/instance which as of now does not check whether distributed storage is configured, running or is available. This is also the check for kube-probe. If an instance is using distributed storage and that storage fails, Quay also fails.
GET /health/endtoend does check distributed storage engines but it only checks preferred storage engine. In georeplicated environments, the endtoend health should check all defined storage engines, not just the preferred one.

Diagnostic Steps

This content is not included.Check if Quay backend storage is working and what response you get.

$ curl -X GET -k https://quay.openshift.com/health/endtoend | jq
{
  "data": {
    "services": {
      "auth": true,
      "database": true,
      "redis": true,
      "storage": true
    }
  },
  "status_code": 200
}

curl -X GET -k https://quay.openshift.com/health/endtoend
<html>
  <head>
    <title>Internal Server Error</title>
  </head>
  <body>
    <h1><p>Internal Server Error</p></h1>
    
  </body>
</html>

Check Quay debug logs:

gunicorn-web stdout | 2023-03-08 07:51:29,088 [187] [ERROR] [health.services] Storage check failed with exception An error occurred (NoSuchBucket) when calling the PutObject operation: The specified bucket does not exist
gunicorn-web stdout | Traceback (most recent call last):
gunicorn-web stdout |   File "/quay-registry/health/services.py", line 76, in _check_storage
gunicorn-web stdout |     storage.validate(storage.preferred_locations, app.config["HTTPCLIENT"])
gunicorn-web stdout |   File "/quay-registry/storage/distributedstorage.py", line 27, in wrapper
gunicorn-web stdout |     return storage_func(*args, **kwargs)
gunicorn-web stdout |   File "/quay-registry/storage/basestorage.py", line 54, in validate
gunicorn-web stdout |     self.put_content("_verify", b"testing 123")
gunicorn-web stdout |   File "/quay-registry/storage/cloud.py", line 227, in put_content
gunicorn-web stdout |     obj.put(Body=content, **self._upload_params)
gunicorn-web stdout |   File "/usr/local/lib/python3.9/site-packages/boto3/resources/factory.py", line 580, in do_action
gunicorn-web stdout |     response = action(self, *args, **kwargs)
gunicorn-web stdout |   File "/usr/local/lib/python3.9/site-packages/boto3/resources/action.py", line 88, in__call__
gunicorn-web stdout |     response = getattr(parent.meta.client, operation_name)(*args, **params)
gunicorn-web stdout |   File "/usr/local/lib/python3.9/site-packages/botocore/client.py", line 415, in _api_call
gunicorn-web stdout |     return self._make_api_call(operation_name, kwargs)
gunicorn-web stdout |   File "/usr/local/lib/python3.9/site-packages/botocore/client.py", line 745, in _make_api_call
gunicorn-web stdout |     raise error_class(parsed_response, operation_name)
gunicorn-web stdout | botocore.errorfactory.NoSuchBucket: An error occurred (NoSuchBucket) when calling the PutObject operation: The specified bucket does not exist

SBR

Shift Container Registry

Product(s)

Red Hat Quay

Components

quay

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.