Images fail to be pushed/pulled to/from Quay due to certificate SSL error which causes a 502 to be reported
Environment
- Red Hat Quay (Quay)
- 3.7
- Red Hat OpenShift Container Platform (RHOCP)
- 4
- Red Hat OpenShift Data Foundation
Issue
Image push to Quay is failing with a 502 Bad Gateway status
$ podman push registry.example.com/project-name/imagename
Getting image source signatures
Copying blob 33e20b7ab3f3 [--------------------------------------] 8.0b / 20.0KiB
Copying blob 4234c9bfd6aa [--------------------------------------] 8.0b / 55.8MiB
Copying blob 9f9118003e6z [--------------------------------------] 8.0b / 183.4MiB
Copying blob j49dc9259670 [--------------------------------------] 8.0b / 216.6MiB
Error: writing blob: initiating layer upload to /v2/project-name/imagename/blobs/uploads/ in registry.example.com: received unexpected HTTP status: 502 Bad Gateway
Resolution
Note: The two procedures outlined here should functionally be identical, any one of them should resolve the problem described in this article. Applying both of the solutions will cause the initial config bundle to have 2 separate certificates for Noobaa which is not in itself an error and will mean that both certificates will be added to Quay certificate store on next container startup.
Using the OpenShift console
-
Download the new certificate chain for Noobaa endpoint:
oc exec -it <quay-pod-name> -- openssl s_client -connect s3.openshift-storage.svc.cluster.local:443 -showcerts 2>/dev/null </dev/null | sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' >> extra_ca_certs_noobaa.crtReplace the
quay-pod-namehere with any Quay pod name. -
Find the custom config bundle secret name that the operator is using to deploy Quay:
oc get quayregistry name-of-registry -o yaml | grep -i configbundlesecret -
Open OpenShift console and locate the namespace where Quay is deployed. Click on
Workloads -> Secretson the left side and find the custom config bundle secret. Open the secret and set it to editing mode by clickingActions -> Editon the left side. -
Scroll down to the end of the file and create a new key named
extra_ca_cert_noobaa.crt. Paste the content of the fileextra_ca_cert_noobaa.crtwe created earlier inside the secret. -
Save and let the operator reconcile the deplyoment. If reconcilation doesn't happen immediately, delete the Quay operator pod name and let it restart:
oc get pods -n openshift-operators oc delete pod quay-operator-xxxxx-xxxxxxxx -n openshift-operatorsReplace the
-n openshift-operatorshere with another namespace if haven't installed the operator in its default location.
Using the command line interface
-
Grab the new server signer certificate from the cluster store:
oc get secret signing-key -n openshift-service-ca -o json | jq -r '.data."tls.crt"' ; echo LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1... -
Check the config bundle secret name by inspecting the
QuayRegistrycustom resource:$ oc get quayregistry quay -o yaml | grep -i configbundle configBundleSecret: quay-quay-config-bundle-8nf6x -
Check all keys in that specific config bundle secret, there should be a key named
extra_ca_cert_service-ca.crt:oc get secret quay-quay-config-bundle-8nf6x -o json | jq '.data' | cut -d ':' -f1 { "config.yaml" "extra_ca_cert_ca-bundle.crt" "extra_ca_cert_service-ca.crt" "ocp-cluster-wildcard.cert" } -
Patch the secret with the new server signer certificate:
oc patch secret quay-quay-config-bundle-8nf6x --type='json' -p='[{"op":"replace", "path":"/data/extra_ca_cert_service-ca.crt", "value": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1..."}]'The operator should automatically reconcile the deployment. If reconcilation doesn't happen immediately, delete the Quay operator pod name and let it restart:
oc get pods -n openshift-operators oc delete pod quay-operator-xxxxx-xxxxxxxx -n openshift-operatorsReplace the
-n openshift-operatrorshere with another namespace if haven't installed the operator in its default location.
Root Cause
The issue is caused by either the Noobaa certificate rotation or the service signing root CA rotation on the whole cluster. Although the operator should cover that scenario and should update certificates, there's a bug that is currently preventing it from doing so. This bug is tracked in the following JIRA:
At this time, the only workaround is to manually add the new certificate chain to Quay's deployment after it has rotated.
Diagnostic Steps
We observe the following error in Quay logs (with debugging enabled):
2023-06-14T20:14:35.568255077Z gunicorn-registry stdout | File "/usr/local/lib/python3.9/site-packages/botocore/retryhandler.py", line 233, in __call__
2023-06-14T20:14:35.568255077Z gunicorn-registry stdout | return self._check_caught_exception(
2023-06-14T20:14:35.568255077Z gunicorn-registry stdout | File "/usr/local/lib/python3.9/site-packages/botocore/retryhandler.py", line 376, in _check_caught_exception
2023-06-14T20:14:35.568255077Z gunicorn-registry stdout | raise caught_exception
2023-06-14T20:14:35.568255077Z gunicorn-registry stdout | File "/usr/local/lib/python3.9/site-packages/botocore/endpoint.py", line 249, in _do_get_response
2023-06-14T20:14:35.568255077Z gunicorn-registry stdout | http_response = self._send(request)
2023-06-14T20:14:35.568255077Z gunicorn-registry stdout | File "/usr/local/lib/python3.9/site-packages/botocore/endpoint.py", line 321, in _send
2023-06-14T20:14:35.568255077Z gunicorn-registry stdout | return self.http_session.send(request)
2023-06-14T20:14:35.568255077Z gunicorn-registry stdout | File "/usr/local/lib/python3.9/site-packages/botocore/httpsession.py", line 466, in send
2023-06-14T20:14:35.568255077Z gunicorn-registry stdout | raise SSLError(endpoint_url=request.url, error=e)
2023-06-14T20:14:35.568255077Z gunicorn-registry stdout | botocore.exceptions.SSLError: SSL validation failed for https://s3.openshift-storage.svc.cluster.local:443/quay-datastore-2de49366-4ce7-41d1-a4e4-243f047d9eb5 [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1129)
2023-06-14T20:14:35.569267414Z nginx stdout | 2023/06/14 20:14:35 [error] 94#0: *173 upstream prematurely closed connection while reading response header from upstream, client: 10.155.4.1, server: _, request: "POST /v2/{NAMESPACE}/{REPOSITORY}/blobs/uploads/ HTTP/1.1", upstream: "http://unix:/tmp/gunicorn_registry.sock:/v2/{NAMESPACE}/{REPOSITORY}/blobs/uploads/", host: "QUAY_HOSTNAME"
2023-06-14T20:14:35.572281659Z nginx stdout | 10.155.4.1 (-) - - [14/Jun/2023:20:14:35 +0000] "POST /v2/{NAMESPACE}/{REPOSITORY}/blobs/uploads/ HTTP/1.1" 502 337 "-" "containers/5.22.1 (github.com/containers/image)" (7.269 1403 7.266 : 0.003)
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.