Geo-replication storage health-check issue
Environment
- Red Hat Quay
- 3.7
Issue
- In a 3 sites geo-replicated Quay environment, one of the three Quay sites went down due to a storage failure. Restarting the Quay pods in remaining 2 sites caused complete Quay shutdown. Is this expected?
- In georeplicated environments, the
GET /health/endtoenddoes not check distributed storage engines but it only checkspreferred storageengine.
Resolution
Workaround 1. The failure of storage engines, from remaining 2 DCs, due to restarting quay pods is an expected behavior. To fix it add overrides to the QuayRegistry CRD that would disable initial validation:
spec:
- kind: quay
managed: true
overrides:
env:
- name: IGNORE_VALIDATION
value: "true"
- Note:
- The
valueis a boolean so it has to be in quotation marks. This will force Quay to restart even though there may be issues with specific duri was initiated. This restart also runs the config tool as the first process which does a sanity check on the config and ensures that all components that Quay hooks to are available. This is what then failed all pods. To circumvent that problem, you cang startup. - The overrides field is not to be taken lightly and should be removed from the QuayRegistry CRD as soon as possible.
- The
Workaround 2. Remove the offending storage engine from Quay's config.yaml file. Sample configuration with one storage engine removed looks like below:
...
DISTRIBUTED_STORAGE_CONFIG:
default: # storage name
- RadosGWStorage # storage driver
- access_key: minioadmin # driver parameters
bucket_name: quay
hostname: 10.0.0.1
is_secure: false
port: "9000"
secret_key: minioadmin
storage_path: /datastorage/registry
swift: # storage name
- SwiftStorage # storage driver
- auth_url: http://10.0.50.50/identity # driver parameters
auth_version: "3"
os_options:
DISTRIBUTED_STORAGE_DEFAULT_LOCATIONS:
- default
- swift
DISTRIBUTED_STORAGE_PREFERENCE:
- default
- swift
...
-
To successfully remove a certain storage engine, one must remove the storage name, driver and all related parameters to that driver from the Quay config.yaml file. Also remove the storage driver name from
DISTRIBUTED_STORAGE_DEFAULT_LOCATIONSandDISTRIBUTED_STORAGE_PREFERENCEfields. -
Note:
- This
changemust be done onallQuay instances you're running. Quay pods should come online afterwards. - Images that are completely stored in the failed DC will not be pullable.
- Georeplication is an async operation, it happens in batches and after the image has been completely pushed to the registry. There is no guarantee that all blobs for all images pushed to the failed DC were transferred to other storage locations in time. If such an image is encountered, it should be repushed to Quay again.
- After the failed storage engine has been restored, the configuration for that storage engine should be restored to remaining 2 Quay instances and Quay should be restarted. One needs to enqueue blobs that are now in the remaining two DCs to be pushed to the failed DC. This can be done with the following script:
$ oc exec -it quay-pod-name -- pythom -m util.backfillreplication - This
Proposed solution: This content is not included.PROJQUAY-5074. It comprises of following:
-
The
GET /health/instancewill check each instance's preferred storage engine. Example: In DC A, it will check storage A, in DC B storage B and so on. If DC B storage fails, the new health check will automatically fail Quay pods running in DC B, while other pods will continue to function. -
Change in Quay's logic when it comes to the end-to-end health check. If N-1 storage engines fail, the health check should return a 200 with a warning saying service is degraded and a list of storage engines that are not available. Local instances whose storage engines failed should already be removed from the load balancing scheme because their instance check already failed. Instead of just checking preferred storage which depends on the instance that we're hitting, it will now check all storage engines defined in Quay's config.yaml file.
Root Cause
-
Remaining two DCs fails because a restart was initiated on their Quay pods. The restart runs the config tool as the first process which does a sanity check on the configuration and ensures that all components that Quay hooks to are available. As the check fails, quay pods fail. The check is GET /health/instance which as of now does not check whether distributed storage is configured, running or is available. This is also the check for kube-probe. If an instance is using distributed storage and that storage fails, Quay also fails.
-
GET /health/endtoend does check distributed storage engines but it only checks preferred storage engine. In georeplicated environments, the endtoend health should check all defined storage engines, not just the preferred one.
Diagnostic Steps
- This content is not included.Check if Quay backend storage is working and what response you get.
$ curl -X GET -k https://quay.openshift.com/health/endtoend | jq
{
"data": {
"services": {
"auth": true,
"database": true,
"redis": true,
"storage": true
}
},
"status_code": 200
}
curl -X GET -k https://quay.openshift.com/health/endtoend
<html>
<head>
<title>Internal Server Error</title>
</head>
<body>
<h1><p>Internal Server Error</p></h1>
</body>
</html>
- Check Quay debug logs:
gunicorn-web stdout | 2023-03-08 07:51:29,088 [187] [ERROR] [health.services] Storage check failed with exception An error occurred (NoSuchBucket) when calling the PutObject operation: The specified bucket does not exist
gunicorn-web stdout | Traceback (most recent call last):
gunicorn-web stdout | File "/quay-registry/health/services.py", line 76, in _check_storage
gunicorn-web stdout | storage.validate(storage.preferred_locations, app.config["HTTPCLIENT"])
gunicorn-web stdout | File "/quay-registry/storage/distributedstorage.py", line 27, in wrapper
gunicorn-web stdout | return storage_func(*args, **kwargs)
gunicorn-web stdout | File "/quay-registry/storage/basestorage.py", line 54, in validate
gunicorn-web stdout | self.put_content("_verify", b"testing 123")
gunicorn-web stdout | File "/quay-registry/storage/cloud.py", line 227, in put_content
gunicorn-web stdout | obj.put(Body=content, **self._upload_params)
gunicorn-web stdout | File "/usr/local/lib/python3.9/site-packages/boto3/resources/factory.py", line 580, in do_action
gunicorn-web stdout | response = action(self, *args, **kwargs)
gunicorn-web stdout | File "/usr/local/lib/python3.9/site-packages/boto3/resources/action.py", line 88, in__call__
gunicorn-web stdout | response = getattr(parent.meta.client, operation_name)(*args, **params)
gunicorn-web stdout | File "/usr/local/lib/python3.9/site-packages/botocore/client.py", line 415, in _api_call
gunicorn-web stdout | return self._make_api_call(operation_name, kwargs)
gunicorn-web stdout | File "/usr/local/lib/python3.9/site-packages/botocore/client.py", line 745, in _make_api_call
gunicorn-web stdout | raise error_class(parsed_response, operation_name)
gunicorn-web stdout | botocore.errorfactory.NoSuchBucket: An error occurred (NoSuchBucket) when calling the PutObject operation: The specified bucket does not exist
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.