Planning A Disaster Recovery Strategy On Microsoft Azure - Designing Geo-replication Strategy


Previously we talked about Working with Data Backup in Azure. In this article, we will discuss how Azure PaaS services handle disaster recovery and failover. We'll examine how to deploy an existing Azure web application and Azure SQL Database across multiple Azure regions.

Failure Mode Analysis for Azure Applications (FMA)

Failure Mode Analysis (FMA) is a method of incorporating resilience into a system by identifying potential failure locations. Consider the following scenario: our solution comprises a Web application with an Azure SQL database to hold user data. JSON documents will be stored in Azure Cosmo DB, as well as Azure storage. Maintain images of each of these components on the blob storage for the time being. Failure mode analysis should be used to try to identify potential failure spots in these components.

General Process to conduct an FMA

All of the system's components must be identified. Therefore, we must be aware that we have a Web application. A web API is also available. Azure includes a SQL database, Cosmos DB is a database management system. We must identify potential failures for each component and determine whether they can be remedied. For example, the Web application may become unresponsive. Each failure mode should also be rated in terms of its total risk. As a result, we must respond to the question of what will happen, and what will the consequences be? When one of our architecture's components breaks. We should determine how the program would behave and recover for each failure mode.

App Service Shut Down

When it comes to the subject of up. Shut down, service up, because the application was idle, it was unloaded; however, the up could be crushed, resulting in an abrupt shutdown. In this case, we can ask what is the recovery plan and diagnostics.

Recovery and Diagnostics

The application is immediately restarted on the following request if it was emptied while idle. However, if there are app crashes or the app service virtual machine becomes unavailable, we can enable the always-on setting to prevent applications from being unloaded while they are idle. The application is automatically restarted by the app service. To ensure that we can trace all issues and handle them immediately, we should activate diagnostics logging for Web apps in the Azure web app service.

There could be a problem with the SQL database connection. For queries, we should read from a secondary replica, the database must be configured for active geo-replication in this scenario. In the source code, we should also catch two types of exceptions: invalid operation exception and SQL exception.

There may be an issue. The storage account has been written to recover from transient failures, we should retry the operation. The SDK handles it automatically when we try a policy in the storage account. We should perform a graceful fallback if N retry attempts fail. We can save the data in a local cache in this situation.

When it comes to Azure Cosmo DB data read Write problems, we need also to employ storage metrics to figure out what went wrong. Those two forms of exceptions should be caught.

  •  HTTP request exception
  •  Document client exception

We should also try to replicate the cosmos DB database across two or more regions. Their SDK automatically retries failed requests. We should also check HTTP status When Cosmos DB troubles the client, it returns on HTTP for 29 errors. When possible, Precise the document to a backup Queue and process the queue later. Also, look at all errors on the client-side

Web Application in Multiple Azure Regions for High Availability

We will be deploying a web application in many Azure regions for high availability, and we will talk about deploying a web application in multiple other Azure regions for high availability. This diagram shows Web apps that have been deployed in two separate regions: There is Azure front door service, which duplicates Azure SQL database, Cosmos DB, and storage account in the front of those regions. It enables the left-side traffic from end-users to be balanced. When one region is unavailable, the Azure front door can redirect traffic to the other.


During regular operations, the application is distributed to each area. The root of the problem is traffic. Traffic is directed to the secondary area if the primary region becomes unavailable. There are two types of regions: primary and secondary. The Azure front door service is utilized to load balance traffic, and data stores such as Cosmos DB have geo-replication enabled. If the primary region is affected, the Azure front door can be utilized to failover to the secondary region.

Architectural recommendations

We have this setting to send all requests to the primary region unless the end for that region becomes inaccessible, which is why it is recommended to use regional pairing to prioritize requests to regions within the same regional pair (for example, East Europe and West Europe). For the SQL database, he used active geo-replication. Failover to a backup database if your primary database fails or needs to be taken offline, Azure Cosmos DB uses geo-replication. Cosmos DB allows for cross-region geo-replication. We read Access Geo-Redundant Azure Storage from several locations.

We have geo-redundant storage at our disposal. A secondary region is used to copy the data. We can be confident that our system will be more resilient because of all these deeps, and that handling failover will be much easier.


We talked about failure for Azure applications mode analysis is performed. For high availability, web applications are deployed across different Azure regions.