EtcdCertSignerControllerDegraded error on etcd operator

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform 4.x.

Issue

  • Following certificate error is returned when executing oc describe co etcd.
    • The same error message shows up multiple times, at least once for every etcd member.
    • Server names and IP addresses have been hidden.
Message:               EtcdCertSignerControllerDegraded: [SAN for the certificate <etcd_member_name> does not include <IP>: x509: certificate is valid for <IP>, not <IP>, SAN for the certificate [...]    
    Reason:                EtcdCertSignerController_Error
    Status:                True
    Type:                  Degraded

Resolution

  • First remove all the secrets starting with etcd-serving-metrics-, etcd-serving and etcd-peer in the project openshift-etcd. You can list them by executing the command below:
$ oc get secret | egrep 'etcd-serving-metrics|etcd-peer'                                            
etcd-peer-ip-10-0-150-219.eu-central-1.compute.internal              kubernetes.io/tls                     2      19s
etcd-peer-ip-10-0-163-28.eu-central-1.compute.internal               kubernetes.io/tls                     2      18s
etcd-peer-ip-10-0-211-121.eu-central-1.compute.internal              kubernetes.io/tls                     2      18s
etcd-serving-metrics-ip-10-0-150-219.eu-central-1.compute.internal   kubernetes.io/tls                     2      19s
etcd-serving-metrics-ip-10-0-163-28.eu-central-1.compute.internal    kubernetes.io/tls                     2      18s
etcd-serving-metrics-ip-10-0-211-121.eu-central-1.compute.internal   kubernetes.io/tls                     2      18s
etcd-serving-ip-10-0-150-219.eu-central-1.compute.internal   kubernetes.io/tls                     2      19s
etcd-serving-ip-10-0-163-28.eu-central-1.compute.internal    kubernetes.io/tls                     2      18s
etcd-serving-ip-10-0-211-121.eu-central-1.compute.internal   kubernetes.io/tls                     2      18s
  • You can check the certificate data by running the following (example for one certificate where OCP has IPv4, IPv6 dual stack):
$ oc get secret etcd-serving-ip-10-0-163-28.eu-central-1.compute.internal  -o json | jq '.data."tls.crt" | @base64d' |sed -e's/\\n/\n/g' |sed -e 's/"//g' | openssl x509 -noout -text | grep -i Alt -A1
           X509v3 Subject Alternative Name:
                DNS:etcd.kube-system.svc, DNS:etcd.kube-system.svc.cluster.local, DNS:etcd.openshift-etcd.svc, DNS:etcd.openshift-etcd.svc.cluster.local, DNS:localhost, DNS:::1, DNS:10.0.163.28, DNS:127.0.0.1, DNS:2d00:9a00:6000:30c::28:343e, DNS:::1, IP Address:0:0:0:0:0:0:0:1, IP Address:10.0.163.28, IP Address:127.0.0.1, IP Address:2D00:9A00:6000:30C::28:343E, IP Address:0:0:0:0:0:0:0:1
  • Remove the only secrets which do not have valid IPs for the node in the certificate Alternative Name.

  • To remove them, you can execute oc delete secret <secret_name>.

  • Then Content from etcd.io is not included.update advertised peer URLs to reflect the new IP addresses.

  • In case master nodes have 2 NIC's configured and issue is occurring due to 2nd NIC(ip) even after updating peer url, a workaround is to disable secondary NIC on nodes to proceed with upgrade

  • In case the steps described do not solve the problem for you or you find any other issue, please contact Red Hat Support in order to investigate the problem further.

Root Cause

The problem has two possible root causes:

  1. The IP addresses of the master nodes are not persistent and they changed after a reboot. This can happen, for example, during an upgrade.
  • Make sure that the IP addresses of your masters are persistent and never change even if they are rebooted. If your cluster is in AWS or Azure, you can check this with your cloud provider.
  1. A second IP address was configured in the master nodes after the cluster was installed.
  • This is a known bug. At the time when this KCS article is being written (latest version: 4.7), Red Hat Engineering is working to fix it. Please, contact Red Hat Support for more information.

In case you have experienced this problem, but none of the circumstances described are applicable; please contact Red Hat Support and report your problem.

SBR
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.