Worker machines not joining cluster after installing, or upgrading to, 4.14 OpenShift on Azure Stack Hub platform
Environment
- Red Hat OpenShift Container Platform 4.13 and 4.14 on Azure Stack Hub
Issue
- The problem occurs when installing OpenShift 4.14, or when upgrading from OpenShift 4.13 to 4.14, on Azure Stack Hub platform.
- When upgrading from 4.13 to 4.14 the installation will succeed, but the Cloud Controller Manager will report errors in its logs (see Diagnostic Steps section for details).
- When installing 4.14, worker nodes will fail to be added to the public load balancer, this will lead to the worker nodes not being added to the cluster, and the associated machine objects not being created.
- For new installations, the OpenShift console will not be available, nor will other workloads with load balancer type services that are exposed from worker nodes.
Resolution
- This is a known bug and Red Hat Engineering is working to solve it: This content is not included.OCPBUGS-20213.
- To properly configure an OpenShift 4.14 cluster for use on Azure Stack Hub, users must update the default virtual machine type to “standard” in the cloud controller manager configuration.
- Workaround (Upgrading): to resolve this issue when upgrading from OpenShift version 4.13 to version 4.14, the ConfigMap “cloud-provider-config” in the “openshift-config” project must be updated before running the upgrade commands. Before running the final upgrade commands, please follow these instructions:
- Run
oc edit -n openshift-config configmap cloud-provider-config. - Edit the value for the
cloud.confunder thedatafield to contain the following string:\"vmType\": \"standard\". - The end result should look similar to this:
- Run
apiVersion: v1
data:
config: "{\n\t\"cloud\": \"AzureStackCloud\",\n\t\"tenantId\": \"{CENSORED}\",\n\t\"aadClientId\":
\"\",\n\t\"aadClientSecret\": \"\",\n\t\"aadClientCertPath\": \"\",\n\t\"aadClientCertPassword\":
\"\",\n\t\"useManagedIdentityExtension\": false,\n\t\"userAssignedIdentityID\":
\"\",\n\t\"subscriptionId\": \"{CENSORED}\",\n\t\"resourceManagerEndpoint\":
\"https://{CENSORED}\",\n\t\"resourceGroup\": \"testing-rg\",\n\t\"location\":
\"{CENSORED}\",\n\t\"vnetName\": \"testing-vnet\",\n\t\"vnetResourceGroup\":
\"testing-rg\",\n\t\"subnetName\": \"testing-worker-subnet\",\n\t\"securityGroupName\":
\"testing-nsg\",\n\t\"routeTableName\": \"testing-node-routetable\",\n\t\"primaryAvailabilitySetName\":
\"\",\n\t\"vmType\": \"standard\",\n\t\"primaryScaleSetName\": \"\",\n\t\"cloudProviderBackoff\":
true,\n\t\"cloudProviderBackoffRetries\": 0,\n\t\"cloudProviderBackoffExponent\":
0,\n\t\"cloudProviderBackoffDuration\": 6,\n\t\"cloudProviderBackoffJitter\":
0,\n\t\"cloudProviderRateLimit\": false,\n\t\"cloudProviderRateLimitQPS\": 0,\n\t\"cloudProviderRateLimitBucket\":
0,\n\t\"cloudProviderRateLimitQPSWrite\": 0,\n\t\"cloudProviderRateLimitBucketWrite\":
0,\n\t\"useInstanceMetadata\": false,\n\t\"loadBalancerSku\": \"basic\",\n\t\"excludeMasterFromStandardLB\":
false,\n\t\"disableOutboundSNAT\": null,\n\t\"maximumLoadBalancerRuleCount\":
0\n}\n"
endpoints: '{"name":"HybridEnvironment","managementPortalURL":"","publishSettingsURL":"","serviceManagementEndpoint":"","resourceManagerEndpoint":"https://{CENSORED}","activeDirectoryEndpoint":"https://login.microsoftonline.com/","galleryEndpoint":"https://{CENSORED}/","keyVaultEndpoint":"https://{CENSORED}","managedHSMEndpoint":"","graphEndpoint":"https://graph.windows.net/","serviceBusEndpoint":"","batchManagementEndpoint":"","microsoftGraphEndpoint":"","storageEndpointSuffix":"{CENSORED}","cosmosDBDNSSuffix":"","mariaDBDNSSuffix":"","mySqlDatabaseDNSSuffix":"","postgresqlDatabaseDNSSuffix":"","sqlDatabaseDNSSuffix":"","trafficManagerDNSSuffix":"","keyVaultDNSSuffix":"vault.mtcazs.wwtatc.com","managedHSMDNSSuffix":"","serviceBusEndpointSuffix":"","serviceManagementVMDNSSuffix":"","resourceManagerVMDNSSuffix":"","containerRegistryDNSSuffix":"","tokenAudience":"https://{CENSORED}","apiManagementHostNameSuffix":"","synapseEndpointSuffix":"","datalakeSuffix":"","resourceIdentifiers":{"graph":"","keyVault":"","datalake":"","batch":"","operationalInsights":"","ossRDBMS":"","storage":"","synapse":"","serviceBus":"","sqlDatabase":"","cosmosDB":"","managedHSM":"","microsoftGraph":""}}'
kind: ConfigMap
metadata:
creationTimestamp: "2023-01-01T00:00:00Z"
name: cloud-provider-config
namespace: openshift-config
resourceVersion: "0"
uid: {CENSORED}
- Workaround (Installing): to resolve this issue when installing new OpenShift 4.14 clusters on Azure Stack Hub, the cloud provider configuration must be updated before running the installation is started. Before running the final installation commands, please follow these instructions:
- Run
openshift-install create manifeststo create the necessary installation manifests. - Edit the
manifests/cloud-provider-config.yamlfile to update the value for thecloud.confunder thedatafield to contain the following string:\"vmType\": \"standard\". - The end result should look similar to this:
- Run
apiVersion: v1
data:
config: "{\n\t\"cloud\": \"AzureStackCloud\",\n\t\"tenantId\": \"{CENSORED}\",\n\t\"aadClientId\":
\"\",\n\t\"aadClientSecret\": \"\",\n\t\"aadClientCertPath\": \"\",\n\t\"aadClientCertPassword\":
\"\",\n\t\"useManagedIdentityExtension\": false,\n\t\"userAssignedIdentityID\":
\"\",\n\t\"subscriptionId\": \"{CENSORED}\",\n\t\"resourceManagerEndpoint\":
\"https://{CENSORED}\",\n\t\"resourceGroup\": \"testing-rg\",\n\t\"location\":
\"{CENSORED}\",\n\t\"vnetName\": \"testing-vnet\",\n\t\"vnetResourceGroup\":
\"testing-rg\",\n\t\"subnetName\": \"testing-worker-subnet\",\n\t\"securityGroupName\":
\"testing-nsg\",\n\t\"routeTableName\": \"testing-node-routetable\",\n\t\"primaryAvailabilitySetName\":
\"\",\n\t\"vmType\": \"standard\",\n\t\"primaryScaleSetName\": \"\",\n\t\"cloudProviderBackoff\":
true,\n\t\"cloudProviderBackoffRetries\": 0,\n\t\"cloudProviderBackoffExponent\":
0,\n\t\"cloudProviderBackoffDuration\": 6,\n\t\"cloudProviderBackoffJitter\":
0,\n\t\"cloudProviderRateLimit\": false,\n\t\"cloudProviderRateLimitQPS\": 0,\n\t\"cloudProviderRateLimitBucket\":
0,\n\t\"cloudProviderRateLimitQPSWrite\": 0,\n\t\"cloudProviderRateLimitBucketWrite\":
0,\n\t\"useInstanceMetadata\": false,\n\t\"loadBalancerSku\": \"basic\",\n\t\"excludeMasterFromStandardLB\":
false,\n\t\"disableOutboundSNAT\": null,\n\t\"maximumLoadBalancerRuleCount\":
0\n}\n"
endpoints: '{"name":"HybridEnvironment","managementPortalURL":"","publishSettingsURL":"","serviceManagementEndpoint":"","resourceManagerEndpoint":"https://{CENSORED}","activeDirectoryEndpoint":"https://login.microsoftonline.com/","galleryEndpoint":"https://{CENSORED}/","keyVaultEndpoint":"https://{CENSORED}","managedHSMEndpoint":"","graphEndpoint":"https://graph.windows.net/","serviceBusEndpoint":"","batchManagementEndpoint":"","microsoftGraphEndpoint":"","storageEndpointSuffix":"{CENSORED}","cosmosDBDNSSuffix":"","mariaDBDNSSuffix":"","mySqlDatabaseDNSSuffix":"","postgresqlDatabaseDNSSuffix":"","sqlDatabaseDNSSuffix":"","trafficManagerDNSSuffix":"","keyVaultDNSSuffix":"vault.mtcazs.wwtatc.com","managedHSMDNSSuffix":"","serviceBusEndpointSuffix":"","serviceManagementVMDNSSuffix":"","resourceManagerVMDNSSuffix":"","containerRegistryDNSSuffix":"","tokenAudience":"https://{CENSORED}","apiManagementHostNameSuffix":"","synapseEndpointSuffix":"","datalakeSuffix":"","resourceIdentifiers":{"graph":"","keyVault":"","datalake":"","batch":"","operationalInsights":"","ossRDBMS":"","storage":"","synapse":"","serviceBus":"","sqlDatabase":"","cosmosDB":"","managedHSM":"","microsoftGraph":""}}'
kind: ConfigMap
metadata:
creationTimestamp: null
name: cloud-provider-config
namespace: openshift-config
- Continue the installation by running
openshift-install create cluster.
Root Cause
A recent change in the Azure cloud controller manager has set a new default for virtual machines to use the Virtual Machine Scale Set (VMSS) type. This change is benign when using the cloud controller manager on Azure. When deploying to Azure Stack Hub there is an issue that arises between the virtual machine type and the load balancer type for Azure Stack Hub.
Azure Stack Hub uses “basic” style load balancers instead of “standard” style, as defined by Azure. Virtual machines of the VMSS type are not allowed to be added to “basic” style load balancers. When the changes to the default virtual machine type are combined with the load balancer rules for Azure Stack Hub, this results in a scenario where new nodes are not added to the primary load balancer for the cluster.
Diagnostic Steps
To determine if a cluster is affected by this issue, the logs for the Azure cloud controller manager should be examined. Clusters that are affected by this issue will contain error log lines in their cloud controller managers that are similar to the following:
I1016 10:47:54.294023 1 azure_vmss.go:1497] EnsureHostsInPool skips node testing-worker-0 because VMAS nodes couldn't be added to basic LB with VMSS backends
To find the logs for the cloud controller manager, follow these instructions:
- Find the cloud controller manager pods,
oc get pods -n openshift-cloud-controller-manager, look for pod names that begin withazure-cloud-controller-manager. - Find the active cloud controller manager by examining the logs and looking for the leader detection lines. Managers which are not the active controller will have recent log lines that are similar to
failed to acquire lease openshift-cloud-controller-manager/cloud-controller-manager. - Examine the logs for the active manager looking for lines that contain the substring
EnsureHostsInPool skips node. - If the cloud controller manager logs contain this line, then the cluster is affected.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.