Architectural Design Goals - Availability

Let us start with some definitions.
 

Down Time

 
When your application is down and not available for use, this time period is called "Down Time". Downtime can be very costly if it is in the application's usage hours (i.e., production hours).
 
There can be many reasons for application unavailability. Hardware can fail. OS can fail. An application can become non-responsive due to many reasons.
 

What is ''Availability" then?

 
It is to provide the services to the end-user even when failures are happening, or in other words, decreasing "Down Time" to zero (approximately). Normally, you want your application to be available 24 x 7. It is not a simple goal to achieve.
 

Clustering

 
It means the grouping of similar multiple objects (servers in this example). We may also call a server in a cluster as "node".
Now, let us see some options to increase the availability of your application.
 
Fail-over Cluster
 
It is a cluster (group) of servers. These servers work together. Requests are served by one server primarily but if that server goes down, other server takes over the control and start serving the requests. This process of taking over control when the primary server is not available is called fail-over. Based on the level of "availability" you want, you may have multiple servers as fail-over servers. So, if the second server goes down, third takes control and so on.
  • Advantages
    High Availability

  • Disadvantages
    More cost to have more servers, High Maintainability cost.
Load Balancing Cluster
 
Let us suppose we've one person to serve our customers in a bank. If there will be more customers, they will have to wait to get the service. It means customers are not getting "service" within the expected* time. In other words, service was not "available" for them. How can we make sure that all/maximum customers should get service in expected time? Yes, we can have more persons who will share "load" with the first person. Same is the concept with "Load Balancing". You have multiple (similar) servers. You make them part of a cluster. Now, the request will come to "load balancer" which will decide if the request should go to server 1 or 2 or other. It will decide based on different parameters (e.g. current load on the server). You may have more servers to increase the service "availability".
 

Use of RAID (Redundant Array of Independent Disks)

 
In RAID, the same data is stored in multiple disks. In case one disk is failed, another disk is used without effecting "availability" of data. A new disk can be inserted or a disk can be replaced without impacting other disks in RAID.
 

Notifications and Monitoring

 
Keep an eye on your application and server performance. There should be proper alerts from the system. You should be aware of any critical thing (which may lead to downtime) happening on the server. Keep regular maintenance on the server to avoid any server crashing. You should review the server & application logs to understand the usage trends. There are some paid (example - SolarWind) & open-source software which can let you register different types of events and notify you.
 

Better Customer Support

 
You should have a very good customer support service which can guide your end-users in case of downtime and when services are up.
 
This team should work as a bridge between customers and technical team. Escalating the issue to the right team quickly and responding to the stakeholders efficiently can reduce the downtime. For such purpose, companies maintain "service desk" like Team. This team maintains the details of all applications, relevant technical teams, and the information like how to and whom to escalate according to the issue severity and type. Also, the medium of communication (email, message, call, etc.) is decided. This team should be good and trained in these aspects.
 

Recovery Plan/DR (Disaster Recovery) Plan

 
In some situations, you may lose your server (hardware failure, hard disk failure, etc.). For such cases, you should have a proper recovery plan. SoPs (Standard Operating Procedures) should be defined for such situations. We call such polices or SoPs as DR plan. Simulations should have been done to check how the plan will work in case a disaster really happens. Such disasters may not happen frequently but when they happen, they can bring a big loss if not recovered in time. Also, if the whole application can't be brought available at once, whatever can be brought should be brought. Data Backup strategy is very important in this plan.
 

Use of Cloud-Based Services

 
To take care of most of the above points, you can think of using the "Cloud" based environment. Cloud-based service providers, such as - Microsoft, Google, Amazon, have many data centers throughout the globe. You may choose more appropriate data center and plan according to your "needs" of availability.
 

Conclusion

 
We can see that mostly it is the "architectural" decision to ensure the "availability" of your application. There can be more options based on your specific architecture or application type. The above discussion is not specific to only "web" server but can be extended to any type of servers (e.g. DB servers, App Servers). You also need to see what to do when your application depends on some external resource and that resource is not available. You should have an alternate plan to handle this impact.


Similar Articles