etcd 3.3 Incident Response
On Thursday January 31, 2019 Red Hat released a version of etcd to the public that was inappropriately labeled. This bad image was released in the rhel-7-server-extras-rpms RPM channel and the Red Hat Container Catalog. Specifically, an etcd container image labeled with version3.2.22-24 was released with etcd 3.3.11 inside of the container. By Tuesday February 5th, 2019 Red Hat realized the issue and reverted the etcd container image and RPM channels back to a healthy image. Etcd container 3.2.22-18 is now the correct and healthy container image.
OpenShift 3.0 to 3.9 uses RPMs by default for etcd installations. OpenShift 3.10 to 3.11 uses container images for etcd installations. Not all customers will be affected by this issue. If you are running an RPM based etcd with the RHEL 7 Extras channel attached to the RHEL hosts, and issued a ‘yum update’ for all RPMs installed on you etcd hosts between the 6 days of January 31 to Feb 5th, you would have pulled the bad RPM. Or, if you are running a container based etcd and you have upgraded your cluster, scaled up the etcd nodes or manually installed the latest etcd container image between the 6 days of January 31 to Feb 5th, you would have pulled the bad container image. Please refer to the solution knowledge article for the exact commands to tell if you are running the bad etcd (version 3.3.11). The document also details the procedure you will need to follow to revert back to a good etcd configuration.
Check usage of etcd 3.3 in OpenShift cluster
At this point, the more common OpenShift deployment will be a container based etcd. Customers who may have modified their installation away from the default, and removed or specified the ‘latest’ tag for the etcd Container Image are more likely to be impacted. OpenShift labels the etcd image on the Red Hat Container Catalog as both latest and the specific version. This might cause some users to believe we are pulling the latest image for etcd at all times. This is not a correct assumption.
Users that are mirroring the Red Hat Container Catalog should be following the instructions found in the product documentation:
OCP Disconnected Installation - Populate Registry
Specifically, administrators should be paying attention to the recommendation to tag/import based on a version. Next, for users that want to tell the OpenShift ansible installer to pull from this mirrored registry, they can use the oreg_url variable in the ansible inventory file.
oreg_url=example.registry.com:port/openshift3/ose-${component}:${version}
If you just install OpenShift out of the box, you will get a version locked etcd. If you set oreg_url to the desired imageConfig.format string (as seen above) the installer extrapolates everything from that point correctly. For etcd specifically, the installer would set a tag of "3.2.22" and most other components would get "v3.11" or whatever is appropriate for their release. However, we allow overriding of specific images from the ansible inventory file so that users may use a custom image for one component, but a default image for everything else. When that happens there's no additional logic applied and we use exactly what has been provided. In the case of etcd, customer might override the default from oreg_url with the variables etcd_image and osm_etcd_image. On that string, if you are not version specific, it would look like this:
etcd_image=example.registry.com:port/rhel7/etcd
osm_etcd_image=example.registry.com:port/rhel7/etcd
This can be a problematic setting. If the etcd declared in the override variables are not version specific, OpenShift will pull the latest. What will make matters worse is the fact etcd is now a static pod. Static pods with no tag or the latest tag will always try to pull their container image when they stop and start or their host reboots. This is explained here:
OCP Disconnected Installation - Image Pull Policy