Introduction
Architectural decision for database designing is a very important activity when we are building an Enterprise Grade application. It is very important to choose the right type of Database and have it hosted on the right platform. A couple of years earlier the nature and type of data that was getting generated were similar in most of the cases and the options available to store and handle these data were limited. It was easier for Data Architects then as they just had to choose the right Database software vendor and most of the time the data used to be stored in Relational Databases.
In today’s age, data generated is huge and the nature of data varies a lot. The modern application can be a Conversational Interface or a Bot receiving unstructured conversational data from a user or an IOT device interface receiving lots of real-time data or can be a simple Web Application receiving relational data. This poses a potential challenge for the Data Architects to examine the nature of data and take the right decision to choose the right type of data that can fit in best for the application that is being architected.
While choosing the right database, a lot of considerations have to be accounted for. A few of them like the nature of data, how the data is to be consumed, data security, GDPR requirements, high availability and many more. However, in this article, we will discuss choosing a database based on nature and type of data.
The database offerings native to a database vendor are not covered here, although it is an important aspect to be considered. The discussion in this article comes from a data-agnostic perspective on how to choose the right database to store the data in the right way. We will try to see when to choose a Relational SQL database and when to go for a No SQL database in this article.
Relational Databases
Let us try to understand what Relational Databases are. They are by far the oldest and most used databases adopted widely by enterprises.
In these databases, data is stored in tables as rows and columns. There is a clear non-ambiguous relationship that exists between the data stored in different tables. Each row having data in a table is called a record and is identified by a primary key and the data relationship is exhibited using foreign keys. Normalization is applied to the data while storing it.
In the example below, we have a table having Employee Id, Name and his Role and a second table that has Access Id, Employee Id, and Access. The second table has a foreign key relationship with the first table. Each record in the second table has a relationship with the first table using the EmployeeID key. In simple words, all the three accesses listed out in the second table belong to Employee Id 1003 listed in the first table. This is an example of how a relational database looks.
The very advantage of this kind of database is the ease with which it can be queried. As it has been there for ages there are a lot of standards and capabilities that have evolved around it with a proven rate of adoption at enterprise level thus making it the favorite category of the database for developers and enterprises.
The data can be partitioned and indexed thus increasing the data seek performance. Transactional integrity which can be implemented for any relational database makes the data write safe. All database ACID principles hold good here.
Now what if we have a huge amount of data or highly unstructured data or data schema varies a lot or the data is highly hierarchical in depth. In such cases should the relational database be a choice here? The answer to this is No. The degree of complexity to handle such data with relational database design technique would be relatively very high and the design would not be efficient.
No SQL Databases
In the modern computing paradigm, the data is generated using a wide range of devices and user interfaces. The data can be very huge and the schema may vary like anything. It can be a couple of IoT devices generating huge unstructured data or a simple social application generating data with varying schema. In all such scenarios, No SQL Databases can be used. There can also be a huge number of users that may be accessing the data and performance can be a bottleneck. No SQL Database addresses solve performance issues gracefully as they can scale out easily and can be highly distributed.
ACID principles do not apply or go with No SQL databases. However, this is something to not be worried much about here and it adheres to BASE principles and the CAP theorem.
CAP theorem is applicable to all distributed Databases. The below illustration explains CAP theorem in a nutshell.
BASE principles are derived from CAP theorem. Per BASE, No SQL database system should be basically available always. The system data may change without any user input as per Soft State guidance. Eventual Consistency should be guaranteed always without fail. The terminology BASE can be expanded as per below figure.
There are different types of NO SQL databases. However, in this article, we will explore design guidance for Document Databases and Graph Databases, these two being the most popular choice, and it is quite relevant among developers and enterprises.
Document Databases
This category of No SQL database is the favorite choice for social applications. They can also be extended very easily for real-time systems or applications having an Internet of Things landscape. These databases can accommodate a varying degree of the schema for the data. There is no tightly coupled relationship defined for such databases. These databases can accommodate XML based storage or a JSON based storage. However, JSON is highly popular in this segment of the database. The querying mechanism can be as simple as using REST mechanism to transact with the database. However, they do not support JOIN operations as the relationship between the databases is loosely coupled or cannot be at all.
The below figure represents a document database. A point to note is the varying schema of the data documents it is storing in the database. The flexibility to handle varying schema is well suited in the modern data-generating application and this is what makes it well accepted in the developers’ community nowadays.
Querying a Document database is not that simple as in Relational databases. As the relationship is loosely defined data, it cannot be joined across tables thus making the data retrieval activity complex. Standardization across vendors is again a challenge as these databases are still maturing, unlike relational databases which are there for ages.
Graph Databases
There are many use cases where the data is highly hierarchical. Relationship spans to a high degree of depth. Storing and managing such data is highly complex using relational databases. Such data can be very easily stored using Graph Databases. Social applications can use such databases to define the hierarchy of social data relationship for a person for example, who is a personal friend with, who is his friend of friends, what his friend of a friend of friends like and so on.
A graph database has two components – Node and Relationship. Node is an entity and can be a person, interest, place, animal, etc. A relationship defines how the nodes are related to each other. Relationship explicitly defines the association between entities. The entity is always defined in nodes and the relationship is defined as edges.
The diagram below illustrates a graph database. Circled nodes represent entities – Abhishek, Amit, Dan, Burger, Pizza, and The Royal Hotel. The arrow represents relationships – likes to eat, does not like to eat, follows and available at.
Interpretation from this database can be as below,
- Abhishek follows Amit
- Abhishek likes to eat Burgers available at The Royal Hotel
- Amit likes to eat Pizza available at The Royal Hotel
- Dan follows Abhishek
- Dan likes to eat Burgers available at The Royal Hotel
- Dan does like to eat Burgers and Pizza available at The Royal Hotel
We can see a highly complex relationship between entities could easily be established using Graph Database. Traversing the entities based on the relationship is fairly easy now.
However, there are a few considerations that have to be taken before zeroing on Graph Databases. Data that is flat or has no relationship should not fit here. Data having a hierarchy in width with just a single degree of depth should be tried with a relational database approach first. It is not ideal to store blobs or huge data for a node in the graph database. The data traversing should be starting from a known node else graph database is not a good fit.
Winding-up
There are other databases like Key-Value database and column-store database which is not discussed here. These databases are fairly simple and have very niche usage areas. Choosing a database is a very important consideration in any enterprise architecture. Data is the backbone of all applications. Not having the right database can have a gravely negative impact on application functionality. So one has to choose the database very carefully after studying all associated factors.