Disaster Recovery for Ansible Automation Platform on Azure

Updated

Disaster Recovery for Ansible Automation Platform on Azure

Ansible Automation Platform on Microsoft Azure provides optional additional backup capabilities through a multi-regional model for supported Azure regions. This optional feature is enabled on the “Business Continuity” step during the deployment of the managed application. Enabling this option will backup AAP data from a primary Azure region into a secondary Azure region and incurs additional Azure infrastructure costs for storage in the secondary region. Customers can request this feature be enabled on their instance using a support help request if not selected during deployment.

The disaster recovery feature activates the replication of storage between a primary region and its assigned paired region. AAP on Azure uses Microsoft-defined regional pairs to implement this solution. A list of Azure data center pairs can be found here: Azure Cross-Region Replication (Content from learn.microsoft.com is not included.Content from learn.microsoft.com is not included.https://learn.microsoft.com/en-us/azure/reliability/cross-region-replication-azure).

It should be noted that disaster recovery is not synonymous with high availability. A loss of service and data can occur when the primary region is impacted.

What is the Regional Support for Disaster recovery

The disaster recovery capability is not supported in all Azure regions. Customers should consult the regional support matrix to verify if their desired region is supported before deployment. While primary region nightly backups are standard for all instances, a secondary region can add additional risk reduction in the event of a catastrophic event in the primary Azure region. We are continuously working with Microsoft to expand this capability as they add support for more regions.

Main RegionMulti-Region Disaster Recovery (Y/N)Paired Backup Region
Australia EastYAustralia Southeast
Australia SoutheastYAustralia East
Brazil SouthYSouth Central US
Canada CentralYCanada East
Canada EastYCanada Central
Central IndiaYSouth India
Central USYEast US 2
Chile CentralNn/a
East AsiaYSoutheast Asia
East USYWest US
East US 2YCentral U
France CentralYFrance South
Germany West CentralYGermany North
Indonesia CentralNn/a
Israel CentralNn/a
Italy NorthNn/a
Japan EastYJapan West
Japan WestYJapan East
Korea SouthYKorea Central
Korea CentralYKorea South
Malaysia WestNn/a
Mexico CentralNn/a
New Zealand NorthNn/a
North Central USYSouth Central US
North EuropeYWest Europe
Norway EastYNorway West
Poland CentralNn/a
Qatar CentralNn/a
South Africa NorthYSouth Africa West
South Central USYNorth Central US
South IndiaYCentral India
Southeast AsiaYEast Asia
Spain CentralNn/a
Sweden CentralYSweden South
Switzerland NorthYSwitzerland West
UAE NorthYUAE Central
UK SouthYUK West
UK WestYUK South
West Central USYWest US 2
West EuropeYNorth Europe
West USYEast US
West US 2YWest Central US
West US 3YEast US

How does disaster recovery work?

A nightly backup of the managed application is placed on Azure storage for replication. This backup will be loaded into a new deployment of the Ansible Automation Platform in a non-impacted region. The amount of time required to recover an instance depends on the amount of data being recovered and the availability of Azure resources.

How does my application recover from an event?

The following steps should be taken if your managed application's region is experiencing a service-impacting event:

  1. Deploy a new instance of the managed application to a region of your choice. We recommend you use the region pair of your primary region. You must deploy the second instance of the managed application using the same Azure subscription as your primary instance.
    • Note: To ensure smooth data migration, do not set up any network configurations (such as VNet peering) until the data migration is successfully completed and verified. Once the SRE team confirms a successful recovery, you may proceed with network setup.
  2. Contact Red Hat customer support indicating your managed application's region has failed and your managed application needs to be recovered. Provide the following information:
    • Name of the instance impacted
    • Name of the new instance
    • Azure Subscription ID
    • Contact information for rapid collaboration
  3. Red Hat Site Reliability Engineers (SRE) will prioritize the recovery operation. The time required for a full recovery depends on the availability of Azure resources and the amount of data to recover.
  4. A Red Hat representative will contact you using the information supplied in your support request to indicate the process is complete. Priority will be given to ensure that any issues with the new instance will be addressed promptly.
  5. Perform Additional Configuration The new deployment may need additional configuration to be fully functional. A customer may need to adjust their DNS if using a custom domain. Automation mesh nodes may need to be reconfigured. Firewall rules may need to be adjusted in the customer's network. Some Azure infrastructure configuration options are not preserved in the recovery of an instance. A request to the SRE team will be necessary to implement those special configurations on the new deployment. Examples of these options would be:
    • Azure Private DNS resolver configured on the Virtual Network.
    • Deployment of an Event Hub for exporting logs.
    • Any special adjustment to the Azure infrastructure not captured by the database or an application level configuration.

These estimates can help set expectations if a disaster recovery event occurs.

Task DescriptionWho?Estimated Time
Contact Red Hat customer support and raise a Sev 1 case if the event is happening during a product outage of the original region.CustomerSee Premium Support SLAs
*Deploy a new instance of the managed application.Customer~1.5 hours
Red Hat Site Reliability Engineers will initiate the recovery operation.SRE~2 hours
**A Red Hat representative will contact you using the information supplied in your support request to indicate the process is complete.Red Hat SupportWhen the recovery operation is complete within premium support SLA

*Customers will need to perform “post deployment network configuration” steps on the new environment, such as VNET peering and routing rule definitions. The time for this would be the same that it took when configuring items during the initial implementation of Ansible on Azure. Refer to this link for information about customer responsibilities for Ansible Automation Platform on Microsoft Azure.

**DR estimates will differ based on data volume & network/traffic configurations within each customer's environments dependent on the following variables:

  • The database size of the site being recovered (including job history, inventory, and other AAP data).
  • The number of collections stored in the Private Automation Hub.
  • The number of execution environments (EEs) stored in Private Automation Hub.
  • Recovery may also involve re-configuring network routing between regions, depending on how traffic is redirected.

How can disaster recovery be tested?

This process can be scheduled by submitting a support request asking for a disaster recovery test. Customer's will need to follow the instructions above to perform a test which requires the deployment of a new instance of the managed application. Disaster recovery testing is non-destructive to the original instance. Azure will charge for the infrastructure costs related to the test instance. If the customer fails to remove the test instance within 48 hours (see Azure Marketplace policy for detailed information), the customer will also be charged for the software subscription.

Using the same network configuration in the test instance will prevent the use of the primary instance if the disaster recovery instance is tested by peering with the customer's network. Any automation mesh nodes should not be connected to both deployments at the same time to avoid issues.

The instance used for the disaster recovery test can be deleted when the testing is complete without impact to the original instance.

There is a limit of one disaster recovery test every six months.


Category
Article Type