Planning A Disaster Recovery Strategy On Microsoft Azure - Defining Recovery Requirements


Welcome to the series of designing a disaster recovery strategy on Microsoft. Azure and the initial module determine these prerequisites. Find out how to determine healing requirements. Therefore, for instance, we will talk about how to deploy resiliency tactics. After that, we will look at how to work with data backup on the Azure cloud. Then, for Azure applications, we will see more failure analysis. Then we will focus on creating Andrea's replication plan, and finally, we will wrap up the entire course. We will talk about designing and implementing disaster recovery for Azure applications in this module. RTO, RPO, and RLO are some of the terms we will go over. We will also look at how disaster recovery works for some of the most popular Azure PaaS services, such as Azure App Service, Azure SQL, Cosmos DB, and storage accounts. Then we will look at how to use the traffic manager service.

Defining Recovery Requirements

Resiliency checklist for specific azure services

We will focus on terms like resiliency and disaster recovery, as well as explore several popular Azure services. We will begin with an explanation of two key terms disaster recovery and resiliency.

Disaster recovery

After an outage, disaster recovery details the processes used to restore the solution's availability. It refers to the process of returning systems and data to a previously acceptable state after a partial or complete failure caused by natural or technical events. Consider the possibility that someone removed a table in a database or that our web API stopped working for no apparent reason. From here, we must do all possible to restore the solution's availability. As a result, we must not only determine what occurred but also make this solution available to end-users.


It is the ability of a system to recover from failures and continue to function after a failure has occurred. It is important to keep in mind that each technology has its own set of failure modes while creating and deploying application solutions.

There are some important questions to ask related to resiliency and disaster recovery:

  • How much do we invest in making our application highly available?
  • How much does potential downtime cost our business?
  • What are our customer's availability requirements?

How to implement resiliency?

Determining subscription and service requirements

The process of determining subscription and service requirements entails a number of key processes. Certain resources, such as the number of resource groups, courses, and storage accounts, are limited in every Azure subscription. We have the option of creating a new Azure subscription and provisioning sufficient resources there if our application requirements surpass other subscription constraints.

Apply resilience strategies

Resilience strategies are being implemented. Retried Transient Failures is one of these solutions. Transcend failures can be caused by a temporary lack of network connectivity, a dropped database connection, or a period when the service is busy, and are usually fixed by retrying the request. Another option is to use synchronous operations as often as feasible. While the color waits for the process to finish, synchronous processes might monopolize resources and obstruct other operations. Whenever feasible, design each portion of your programmed to follow for synchronous processes.

Plan for usage patterns

Identify differences in requirements during critical and non-critical periods. Are there any times when the system must be up and running? For example, a tax filing application can fail during filing deadlines and a video streaming service shouldn’t lag during live events. In this situation weighed the cost against the risk.

Identify distinct workloads

Determine at various workloads. Multiple application workloads are common in cloud solutions. In terms of business logic and data storage requirements, a workload is a distinct capability or task that is logically separated from other tasks. For example, an e-commerce app may have the following workloads: Browse and search a product catalog, create and track orders, view recommendations. Each workload has different requirements for availability, scalability, data consistency, and disaster recovery. Make your business decisions by balancing the cause of risk for each workload.

Operate in the multiple regions

Operate across multiple regions in the unusual occasion that your application is deployed to a single area, the entire region becomes unavailable. In addition, your application will be inaccessible. This may be in violation of the terms of your applications. If that is the case, think about deploying your software and its service across multiple locations.

Monitor third-party services

If your application relies on a third-party service, define where and how the service can fail, as well as the impact failures will have on your application.

Apply load balancing

A traffic management system like Azure Traffic Manager is required to apply load balancing to load balance traffic between regions. By removing unhealthy instances for additional rotation, load balancing distributes your applications' requests to healthy service instances.

Identify possible failure points

Identify the system's potential failure sites. Determine the types of failures that occur on the application, my experience, and how the program responds to those failures.

Azure paired regions

An Azure paired region is a location within a geography that contains one or more data centers. Each Azure region is linked with another region within the same geography to form a regional pair. For example, we can deploy our Azure resource in two distinct regions in the North Europe and West Europe regions.

Azure regional pairs

Multiple regional Pharisees are accessible in azure. Here are some more examples. North China, East China, Central France, South France, Germany, Central Germany, Northeast Germany We can see the regional per conceits of two regions on the left side, and we have seen those regions Azure data centers exist. For more details visit:

Planning a Disaster Recovery Strategy on Microsoft Azure

Azure availability zones

Availability Zones are unique physical locations within an Azure region. To ensure resiliency, there’s a minimum of three separate zones in all enabled regions. We can see that we have an Azure region on the left side. There are three distinct zones of availability. We may still use two of the zones if one becomes unavailable.

Azure PaaS Services

Azure Web app services, Azure SQL Database, Azure Cosmos DB, and Azure storage account all offer PaaS services. We will not go into detail about the features that each service provides. We will talk about the services and how to make them more resilient and available.

Planning a Disaster Recovery Strategy on Microsoft Azure

Azure Web App (App services)

Azure web app services, Web applications, Rest APIs, and mobile backend can all be hosted using this HTTP-based service. It supports a variety of languages and frameworks, including

  • Java
  • PHP

We can extend Azure web app service horizontally with numerous instances and vertically by adding power to some resources, such as memory or CPU, on a global scale with high availability.

Azure SQL Database

Azure SQL Database is a relational database that may be used for anything. It is an Azure managed service that allows you to handle both relational and non-relational structures like

  • Jason
  • XML

It also has advanced monitoring and troubleshooting capabilities.

Azure Cosmos DB

Cosmos DB is a globally distributed multi-model database service provided by Azure. It provides data access via a variety of APIs, like

  • Mongo DB
  • SQL

It allows for elastic and independent scaling of throughput and storage across various Azure regions across the world.

Azure Storage Account

A Microsoft Azure storage account is a cloud storage solution for storing current data. Scenarios of these data services are included in Azure storage.

  • Azure blobs
  • Azure files
  • Azure Queues
  • Azure tables

Resiliency checklist for Azure App Service

Now, there is a specific resiliency checklist for each of these services. As an example, we should choose the standard or premium tier for the Azure Web App service. This tier supports staging slots and automated backups. Therefore, if something goes wrong during the deployment, we can revert to a previous version of the application.

It is also a good idea to avoid scaling up or down. Rather, we should choose a tier and instant size that match our performance needs under the usual load, then scale out the instances to address changes in traffic volume. Scaling up and down can cause an application to hang, which can be problematic for end users.

Create production and test app service plans separately. Slots on our production deployment should not be used to test all apps in the same app service plan. If we use the same virtual machine instances for production and test deployments, it can have a detrimental impact on the production deployment. For example, moving test deployments into a separate plan reduces the life of the production side. They are easily accessible in the production version.

The logging of diagnostics is another crucial aspect of our resiliency checklist. We have enabled logging. We can simply track bugs in our program and, of course, swiftly deliver patches. There are a few additional points to be made about resiliency. Please double-check for the azure app service; however, here are some key examples.

Resiliency checklist for Azure SQL Database

When it comes to the resiliency checklist for Azure SQL databases, there are a few key issues to consider, such as whether to use the standard or premium tier. After 45 days, these tiers have a longer point in time restore duration.

SQL database auditing should also be enabled. Auditing is useful for detecting hostile assaults and human faults. Active geo-replication should also be used to build a readable secondary in a separate region. We can do a manual failover to our secondary database if our primary database fails or needs to be taken offline. Until we failover, the secondary database stays read-only.

Point in time restore should also be used to recover from human error. Restore returns our database to a previous state. These elements are critical since our application becomes worthless without sufficient database access.

Resiliency checklist for Azure Cosmos DB

When it comes to resiliency, Checklist for cosmos DB. Two key factors should be considered when using Azure Cosmos DB. The database should be replicated between regions. As a result, even if one region is unavailable, we can still access and read data from another. We should also allow multi-master in another region if desired. We can write to many Azure regions in this situation. It means that even if one region is unavailable, data can still be written to the other.

Resiliency checklist for Azure Storage Account

The resiliency checklist for Azure storage accounts also has some interesting points. We should employ re-access geo-redundant storage for application data. This storage replicates data to a secondary region and allows read-only access from that location. If the primary region's storage fails, the program can read data from the virtual machine discs in the secondary region.

We ought to use it because the discs are properly separated from each other, this provides improved dependability for virtual machines in availability sets. To. Avoid having a single point of failure when it comes to Queue storage. For Queue storage, we should construct a backup Queue in another region. We should instead establish a backup Queue. In a different region's storage account. The program can use the backup queue if there is a storage outage. Until the primary region is available.

Determine and Document RTO, RPO, and RLO Recovery Requirements

Recovery time objective RTO

Let us begin with the RTO (recovery time objective). This is the amount of time and service level within which our business process must be recovered following a disaster to avoid unacceptably negative repercussions. Associate with a break in the flow of events.

This means that in the event of a disaster, such as a white system infection or a user destroying production data, the RTO is the amount of time it will take to recover from the crisis and restore data and applications. The goal of recuperation time is quite crucial. Consider the case of a banking application that goes down in flames. For example, is it better to delete the entire database with the user's table or only a table with the user's data? Users who are unable to sign in to the program are unable to access vital data. We must do everything possible to return the system to a state where users can access it.

Planning a Disaster Recovery Strategy on Microsoft Azure

Recovery Point Objective (RPO)

The goal of the recovery point RPO refers to the amount of time that can pass during a disruption before the amount of data lost during that time exceeds the business's requirements. Maximum Allowable Threshold for Continuity Plans For example, if the last good copy of data available during an outage is from 18 hours ago and the R P O for this business is 20 hours, we are still within the bounds of the business continuity plans RPO. To put it another way, this is the solution to the question. Given the volume of data lost during that time, up to what point might the business process recovery proceed to label?

Recovery Level Objective (RLO)

RLO specifies the level of granularity with which data must be recovered, such as the entire instance, database, or group of databases, or selected tables throughout the entire system. For example, we must determine whether we need to restore the Web application, or whether we need to recover the database structure, or whether we need to recover the entire system.

Planning a Disaster Recovery Strategy on Microsoft Azure

Backup and Disaster Recovery for Azure Application

Disaster recovery plan

Many Azure services have built-in resiliency and availability characteristics. The disaster recovery plan is likely to improve if each service is evaluated separately. Azure SQL Database, for example, supports Geo-replication. We may still access the data from a second Azure region if the data is unavailable in one Azure region due to the deletion of a table, such as a user's table. That is quite beneficial. Another option is Azure Cosmos DB, which allows us to enable geo-replication as well as write to several regions. When tragedy strikes, it is also a good idea to build a disaster recovery plan once our remedy is available. Let us look at some of the aspects of such a plan.

  • Evaluate the business impact of application failures
  • Automate the process as much as possible
  • Document the process, especially any manual steps
  • Choose a cross-region recovery architecture
  • Perform regular disaster simulations to validate and improve the plan

Primarily, we must assess the financial effect of application failures. For example, we should respond to a basic inquiry. What happens if the application fails to function? We will almost certainly lose money. We should also try to automate as much of the procedure as feasible. For example, we should have automatic release pipelines in place to allow for quick recovery. We should also document the process, particularly any manual processes, and give the team members explicit instructions on how to recover from the failure. We should also select a recovery architecture that is applicable across regions. This is something I already mentioned. That we should deploy our solution across multiple regions to enable disaster recovery. And we should also run disaster simulations regularly to test and improve the strategy.

Multiple Azure regions for high availability

Consider the use of multiple Azure regions to achieve high availability. In two regions, we have Web apps. SQL Database, Cosmos DB, and Azure Storage are also available. When a Web application in the First Region goes down, the traffic manager can root Tropic through the Second Region, giving our solution maximum availability.

Planning a Disaster Recovery Strategy on Microsoft Azure

Data corruption and restoration

It is a good idea to keep backups while dealing with data damage and restoration. Backups defend against the loss of an application component due to data corruption or inadvertent deletion; the frequency with which the backup procedure is done sets the recovery point objective RPO. If data in Azure Storage or SQL databases are corrupted or deleted in the primary, Azure stores it three times in different complete domains in the same region. All modifications are copied and replicated to the other copies.

Let us have a look at some high-availability features,

Azure App Service, one of the Azure services, may scale up to 30 virtual machine instances. In the regular and premium levels, it also supports staging slots and automated backups. We can hold the last known good deployment with their deployment slots. If there is an issue, it will be discovered later. There is an easy way to go back to the last known good version.

Your application supports Azure Cosmos DB in all regions. We can still access data from another region if data from one is unavailable. The data was duplicated in the second region. Azure Cosmos DB also supports multiple right regions. It means that even if one region is inaccessible for the data we need, we may still write it to another. After our failover, the client, like violate decay, submits the correct request to the current proper area.

When it comes to high availability with Azure SQL Databases, active geo-replication is available, which allows you to create a readable secondary replica. There can be up to four readable secondary replicas in each zone. If the primary database fails or needs to be taken offline, we can switch to a secondary database. Read-only access

For Azure storage accounts, Geo Redundant Storage provides high availability. It replicates the data to a secondary region and gives the secondary region read-only access to the data.

If the primary area's storage fails, the program can read data from the secondary region. To ensure durability and high availability, the data in the Microsoft Azure storage account is always replicated.

Wrap Up

In this article, we discussed the following, resiliency checklist for specific azure services like Azure drop service, Azure SQL Database, Cosmos DB, and storage account. We explain terms like RTO, RPO, and RLO. In the next article, we will look into details of data backups on Microsoft Azure