Design And Implement A Graph Database In Azure Cosmos DB

Debasis Saha
4y
239.8k
0
9

Article

Introduction

Welcome back to all of you in the fourth article in an overview of CosmosDB – start from scratch article series. In the previous article, we discuss server-side objects in Cosmos DB. Also, we discuss the process or steps to create user-defined functions, a stored procedure, or trigger on Cosmos DB. Now, in this article, we will discuss related to a graph database in Cosmos DB. We will also discuss the different types of Graph DB and how to create a Graph DB using Azure Cosmos DB Account. If you want to read the previous article of this series, then please click on the below links:

This article introduces the basic concept of Graph Database. As we know, Graph Database has recently been too popular due to the rise of social networks. But, despite social network, there are other industries and data which are a good fit for the graph data model. As an example, a graph database might be used to model a rail network and would be extremely efficient at calculating the number of possible routes between different stations. So, at the end of this article, the reader can be able to get an idea about the below points:

Describe the main concept of the Graph Model
Explain how graph databases are supported in Cosmos DB
How to design & create NoSQL Graph Database for support business requirements
Use the Gremlin Console on the Azure Portal
Perform different efficient operations on the Graph Database

Basic Concepts of Graph Model

In the real world, we have many ways through which we can model a problem domain with data. In a graph database, you model objects like people as vertices or nodes. The relationships between those people become edges or arcs. Metadata about objects are stored as properties on both the vertices and edges.

Let’s take a movie database as an example. People and movies are vertices. A person might be an actor or director, and their vertices store their name, age, and nationality. A movie vertex has a genre, release date, and budget properties. The vertices are joined by their relationships, which are the edges. An example of an edge is a person who has an acting role, or roles, in a movie. The properties of that edge would be the names of the roles acted. Another edge would be a movie that’s directed by a person or people. When a graph has been created, it can be traversed to answer questions. You use a process to move between each vertex via the edges connected to it. This traversing is called graph processing, and involves using a traversal language—for example, Gremlin, GraphQL, or Cypher Query Language (CQL)—to gain information from the vertices and edges traversed.

Design And Implement A Graph Database In Azure Cosmos DB

Due to its network structure, a graph database is an excellent option for much common use cases, such as:

Social networking
A typical social networking database contains a significant number of contacts, together with the connections that these contacts have with each other, either directly or indirectly. The database might also reference other types of information, such as shared documents or photographs (these can be physically stored outside of the graph database, with the database holding keys or other identifiers to locate the data). The result is typically a complex network of friends (connections between contacts) and likes (connections between contacts and documents/photographs).
Route calculations
We can use a graph database to help solve complex routing problems that would require considerable resources to resolve by using an algorithmic approach. For example, we might quickly determine the shortest path between two vertices in a highly connected graph.
Managing networks
A graph database is an ideal repository for modeling complex telecommunications or data networks. You use a graph database to help spot weak points in the network—it’s a simple exercise to perform failure analysis by examining what might happen if one or more of the network’s vertices becomes unavailable.
Generating recommendations
We can use a graph database as a recommendation engine. For example, in a retail system, you store information about products that are frequently purchased together. We use the resulting graph to generate "customers who bought XYZ also bought ABC" recommendations when the customer views the details for a product.

Let’s consider how to answer the question: “How is Kevin Bacon connected to Dina Meyer?” by using a graph of actors, directors, and movies. In a relational database, the data might be modeled as follows:

We could use a complex recursive Common Table Expression (CTE), or have multiple nested subqueries to answer the question. Even with appropriate indexes, both options might perform poorly, especially around the fourth, and higher, degrees of separation. But in the case of Graph database, the data will be modeled as follows:

The biggest difference between graph databases and relational databases—the connections between vertices are directly linked together. Therefore, looking for related data is a matter of following connections. The performance problems that are related to index lookups and index maintenance are avoided because you are specifying the connections at the point where data is inserted into the graph. The graph is then efficiently “walked” rather than having to calculate how to join discrete data to satisfy one query at a time.

Graph Databases in Cosmos DB

Cosmos DB databases are globally distributed, and elastically scalable with single-digit latency. Also, by creating our Cosmos DB to use a graph API, the Gremlin graph traversal language is supported at the protocol level. Upon creation of a Cosmos DB, a Gremlin aware endpoint is made available. Unlike other graph databases, such as Neo4j or JanusGraph, a Cosmos DB database that has a graph API will allow you to write SQL-like queries to explore the data and add backend data processing through stored procedures and triggers written in JavaScript. Azure Cosmos DB provides the following features concerning the Graph Database.

Automatic Indexing of All Properties
Elastically Scalable
Globally Distributed
Provides Multi-region replication, including multi-master
Provides Tunable Consistency
Multimodel: Graph data is stored as a document.
We can use Gremlin or SQL APIs for the query.

We can use any programming languages like .NET, Java, Node.js, Python, PHP, etc to use Graph Database for both the SQL API and Gremlin API in our applications. We can use the SQL API to create and delete the graph databases and graphs, setting options on their throughput and partition keys. When you create a database by using the Gremlin.NET driver, it’s possible to insert and query your graph data more efficiently.

Create a Graph Database using Azure Portal

Step 1

Now sign-in to Azure Portal with your credential.

Step 2

On the Azure Portal menu, click on Create a Resource.

Step 3

Now click on Database --> Azure Cosmos DB

Step 4

On the Create Azure Cosmos DB Account page, enter the settings for the new Azure Cosmos DB account, including the location, database name, etc.

Step 5

Now, click on Review + Create Button

Step 6

After validating the setting, click on Create Button to create the Account.

Step 7

Once the deployment has been a success, click on the Go-To Resource Button.

Step 8

Now, click on the Data Explorer option in the left panel.

Step 9

Now click on the New Graph options to create the Database.

Step 10

Now provide the Database ID and Graph ID and then click on the OK button.

Step 11

Now, the Graph Database has been created.

Step 12

Now, click on the new vertex button to add new records. To insert new records in the Graph Database, we need to provide the data as the key-pair format. Also, we need to provide the name of the label of the Graph.

Step 13

Now, once we execute the Gremlin query in the Data Explorer, we can view the graph data both as Graph format & JSON data format.

Azure Cosmos DB supports the Gremlin traversal language. If you have previously written Gremlin queries to be used with a Neo4j or JanusGraph database, those queries will likely run unchanged against Cosmos DB. Cosmos DB also supports the JSON-based GraphSON format of data. The results from a Gremlin query are returned to the client in this format, providing further opportunities for interoperability.

Concept of vertices, labels, edges, and properties

If the domain that we are trying to model is made up of a heterogeneous collection of objects that are related to each other in many ways, then a graphing approach to storing and processing data is a good fit. Using a complex relational database as a starting point for a graph modeling exercise, you will see how relational data isn’t always the most efficient approach. For that purpose, we always need to remember to take a “Node First” approach to the model.

Nodes - Distinct conceptual Identity
Labels - Categories of Nodes
Edges - The relationship between the nodes
Properties - The metadata to be stored against nodes and edges.

Labels in a graph allow us to group vertices into sets—the benefit of applying labels is that you write a query to run against the reduced set of data instead of your whole graph.

Establish the relationship between different graph data

Now, in the Persons Graph (which we created earlier) we need to insert the below data as per the below table:

id	name	age	city
1	Eva	44	New Jersy
2	Albert	36	California
3	Veronica	32	Paris
4	Mohit Sharma	35	New Delhi
5	John	23	London
6	Dip	38	London
7	Robert	40	Paris
8	Eric	38	New Jersy
9	Smith	41	California
10	Stephen	25	New Delhi

Now, these all data are independent of each other. We need to establish a relationship between data. Suppose, we assume that the relationship between these data is Friendships. So, we need to establish the relationship between the above data as per below.

Eva is a Friend of Eric.

Now, select Eva from the result list. In the right-side panel, click on the Edit button under the target section. Now select the Eric from the Target list and provide Friends keyword in the Edge Label text box and then save the relationship. The relation between the mentioned two graph data as shown below. Similarly, we can establish the same type of relationship with other graph data.

Similarly, as per the below image, it is shown that Veronica is a Friends of Eva and Eva is a Friend of Eric.

Now, suppose we establish a relationship between Veronica and all other records. So, now the graph will be shown below:

Basic Overview of Gremlin Query in Graph Database

When we query a graph with Gremlin, we need to consider it as traveling through the data's structure. The correct graph terminology for retrieving data is traversal. The Gremlin query language moves through the data, along the edges that connect the vertices. You should read a Gremlin query as a chain of operations being applied in order from left to right. Using the following graph as an example, you will see how to construct a Gremlin query to traverse it.

A graph is comprised of vertices and edges—the first Gremlin query to consider is:

g.V(); g.E()

The “g” refers to the graph you are traversing. Using Graph Explorer to run the previous code, it returns the whole graph and can be read as “traverse all the vertices, and all the edges in the graph”. There’s no equivalent statement in SQL that shows you all the entities in a database, and how they are all related. A graph gives you the ability to see the whole problem domain in one picture.

In our graph, we have one vertex - persons. In a relational database, if we want to return just the person, the SQL statement is as follows:

SELECT * FROM movies

The equivalent traversal in Gremlin is:

g.V().hasLabel(‘person’)

Once when we run this query in data explorer, the output will be as shown below:

If we want to return the vertex with a specific name like Albert, then the query will be:

g.V().hasLabel(‘person’) .has('name','Albert')

We’ve already seen examples of how Gremlin filters results; the following commands also filter results:

has - specify a tuple of key and value that the entity must-have.
hasLabel - a shortcut to the equivalent - has('label', 'value of the label').
hasNot - specify a tuple of key and value that the entity must not have.
is, not, and, or - Boolean operators to combine conditions.
where - can be used to compare the current position in a traversal when combined with a select(), but also used on its own to filter on a condition.
dedup - remove duplicates at the current position in the traversal.
range - return a range of entities, specified as (from, to).
select - allows the graph to be examined from a previous step in a traversal.
simplePath - stops a traversal from reusing a part of the previous path in the traversal.
cyclicPath - allows the reuse of part of the previous path.

Conclusion

In this article, we discuss the concept of a Graph Database in the Cosmos DB Database. Also, we discuss related to the basic concept of Graph Database, how to create a graph database in azure cosmos DB account, how to insert graph data into a graph database using the Azure portal, and how to establish the relationship between different graph data along with how to perform a basic query using Gremlin API in Cosmos DB. Any suggestions or feedback or queries related to this article are most welcome.