"token-exchange-agent" pod on managed cluster is unstable after upgrade to ODF 4.16.0

Solution Verified - Updated

Environment

  • OpenShift Data Foundation 4.16.0

Issue

On a Disaster Recovery setup, after upgrading the cluster from 4.15.z to 4.16.0, the "token-exchange-agent" pod on managed cluster is unstable. The pod crashes a few times, with a new pod spinning up to replace it. Sometimes there is no crash, but new pod still comes up.

This impacts successful creation of new DRPolicy and Failover action.

Resolution

On ACM hub cluster, run the following commands to get the ManagedClusters used to setup Disaster Recovery and delete the leftover "maintenance" addon from them.

  1. Get names of ManagedClusters that have disaster recovery configured.

    oc get mirrorpeer -A -ojson | jq -r ".items[].spec.items[].clusterName"
    
  2. On each ManagedCluster that has disaster recovery configured, delete the "maintenance" addon from it.

    oc delete managedclusteraddon maintenance -n <ManagedClusterName>
    

Root Cause

In ODF 4.16.0, we removed the "maintenance" addon deployment. However, when upgrading from 4.15.z, the old deployment resources are not cleaned up properly. This causes other component deployments to crash due to resource conflicts.

Diagnostic Steps

After upgrading to ODF 4.16.0, run the following commands on ACM hub cluster.

  1. Get names of ManagedClusters that have disaster recovery configured.

    oc get mirrorpeer -A -ojson | jq -r ".items[].spec.items[].clusterName"
    
  2. On each ManagedCluster that has disaster recovery configured, check if "maintenance" addon is deployed. All clusters that have this "maintenance" addon are impacted.

    oc get managedclusteraddon maintenance -n <ManagedClusterName>
    
  3. For each of the ManagedClusters that are impacted, get the namespace of the unstable "token-exchange-agent" pod.

    oc get managedclusteraddon tokenexchange -n <ManagedClusterName> -ojson | jq .spec.installNamespace
    

On each of the impacted ManagedClusters, run the following commands.

  1. Get all the "token-exchange-agent" pods in the namespace obtained from Step 3 above.

    oc get pod -l app=token-exchange-agent -n <InstallNamespace>
    

    If there are more than 1 pods, the cluster is impacted.

  2. Check logs of the "token-exchange-agent" pod for the following error.

    oc logs <TokenExchangeAgentPod> -n <InstallNamespace>
    
        W0621 08:37:09.418144       1 reflector.go:539] pkg/mod/k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229: failed to list *v1alpha1.MirrorPeer: mirrorpeers.multicluster.odf.openshift.io is forbidden: User "system:open-cluster-management:cluster:sagrawal-c1:addon:maintenance:agent:maintenance" cannot list resource "mirrorpeers" in API group "multicluster.odf.openshift.io" at the cluster scope
    
        E0621 08:37:09.418172       1 reflector.go:147] pkg/mod/k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229: Failed to watch *v1alpha1.MirrorPeer: failed to list *v1alpha1.MirrorPeer: mirrorpeers.multicluster.odf.openshift.io is forbidden: User "system:open-cluster-management:cluster:sagrawal-c1:addon:maintenance:agent:maintenance" cannot list resource "mirrorpeers" in API group "multicluster.odf.openshift.io" at the cluster scope
    
SBR
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.