"token-exchange-agent" pod on managed cluster is unstable after upgrade to ODF 4.16.0
Environment
- OpenShift Data Foundation 4.16.0
Issue
On a Disaster Recovery setup, after upgrading the cluster from 4.15.z to 4.16.0, the "token-exchange-agent" pod on managed cluster is unstable. The pod crashes a few times, with a new pod spinning up to replace it. Sometimes there is no crash, but new pod still comes up.
This impacts successful creation of new DRPolicy and Failover action.
Resolution
On ACM hub cluster, run the following commands to get the ManagedClusters used to setup Disaster Recovery and delete the leftover "maintenance" addon from them.
-
Get names of ManagedClusters that have disaster recovery configured.
oc get mirrorpeer -A -ojson | jq -r ".items[].spec.items[].clusterName" -
On each ManagedCluster that has disaster recovery configured, delete the "maintenance" addon from it.
oc delete managedclusteraddon maintenance -n <ManagedClusterName>
Root Cause
In ODF 4.16.0, we removed the "maintenance" addon deployment. However, when upgrading from 4.15.z, the old deployment resources are not cleaned up properly. This causes other component deployments to crash due to resource conflicts.
Diagnostic Steps
After upgrading to ODF 4.16.0, run the following commands on ACM hub cluster.
-
Get names of ManagedClusters that have disaster recovery configured.
oc get mirrorpeer -A -ojson | jq -r ".items[].spec.items[].clusterName" -
On each ManagedCluster that has disaster recovery configured, check if "maintenance" addon is deployed. All clusters that have this "maintenance" addon are impacted.
oc get managedclusteraddon maintenance -n <ManagedClusterName> -
For each of the ManagedClusters that are impacted, get the namespace of the unstable "token-exchange-agent" pod.
oc get managedclusteraddon tokenexchange -n <ManagedClusterName> -ojson | jq .spec.installNamespace
On each of the impacted ManagedClusters, run the following commands.
-
Get all the "token-exchange-agent" pods in the namespace obtained from Step 3 above.
oc get pod -l app=token-exchange-agent -n <InstallNamespace>If there are more than 1 pods, the cluster is impacted.
-
Check logs of the "token-exchange-agent" pod for the following error.
oc logs <TokenExchangeAgentPod> -n <InstallNamespace>W0621 08:37:09.418144 1 reflector.go:539] pkg/mod/k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229: failed to list *v1alpha1.MirrorPeer: mirrorpeers.multicluster.odf.openshift.io is forbidden: User "system:open-cluster-management:cluster:sagrawal-c1:addon:maintenance:agent:maintenance" cannot list resource "mirrorpeers" in API group "multicluster.odf.openshift.io" at the cluster scope E0621 08:37:09.418172 1 reflector.go:147] pkg/mod/k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229: Failed to watch *v1alpha1.MirrorPeer: failed to list *v1alpha1.MirrorPeer: mirrorpeers.multicluster.odf.openshift.io is forbidden: User "system:open-cluster-management:cluster:sagrawal-c1:addon:maintenance:agent:maintenance" cannot list resource "mirrorpeers" in API group "multicluster.odf.openshift.io" at the cluster scope
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.