Introduction To Graph Databases

Graph databases are based on directed graph theory. These are part of NoSQL database family. Information stored in graph database is in the format noun - verb - noun. And example is shown below - The nodes are entities as in person, corporation, organization, account and edges (arrows) are relationship. For example, Person (A) owns (arrow) a car (B) and so on. Each of the nouns and verbs can have multiple attributes. Like in our example (B) could be a car name and car brand. Each of the nodes can have an incoming edge and /or an outgoing edge.

Graph Databases

Why the need of Graph Databases

Graph databases have some very special use cases where one would want to find long chains of connection between transactions a in the following points:

  • Finding out Money Laundering
  • Searching patterns
  • Mining data from Social chatter
  • Pharmaceutical Industry
  • Bioinformatics
  • Public Transport
  • Public Distribution System
  • Logistics Industry
  • Healthcare Industry
  • IoT
  • And many more.

It is not that relational databases cannot be applied to the above use cases. They can achieve the results albeit with quite an effort. With graph databases implementation of these use cases becomes efficient as expensive multiple join operations are not required. Also the schema of graph database is not rigid. It can be adhoc and evolving. RDBMS is optimized for aggregation and graph database is optimized for connections.

Type of Graph Databases

In general, here are the following two types of graph databases:

  • Resource Description Frameworks (RDFs)
  • Property Graph Databases

RDF graph databases

RDF is a standard model for interchange of structured and semi-structured data on the Web. RDF has features that facilitate data merging even if the underlying schema differ. It specifically supports the evolution of schema over time without requiring all the data consumers to be changed. Some of the market leaders for these are the following:

    Blazegraph - Blazegraph is an ultra-scalable, high-performance graph database. It supports Blueprints and RDF/SPARQL APIs and up to 50 Billion edges on a single machine and has a high availability and scale-out architecture.

    MapGraph - It is a GPU-accelerated plug-in for Blazegraph via a Java API. It runs on one GPU (open source at SourceForge) or on a cluster of GPUs.

    AllegroGraph - The AllegroGraph database is a high-performance, persistent, semantic graph database. It uses efficient memory management in combination with disk-based storage, enabling it to scale to billions of triples/quads while maintaining superior performance. Follows the W3C RDF standard and supports SPARQL, RDFS++, and Prolog. Functionality is supported via Python, Java, Common Lisp and others.

    OpenLink Virtuoso - Supports quads as well as triples, Java, Python and JDBC/ODBC, and is also SPARQL-compliant.

Property graph databases

The introduction at the start of this article describes property graph databases. Some of the market leaders for these are the following:

    Neo4j - A property graph developed by Neo Technology with ACID capability. Supports multiple languages through a REST Web API and facilities for loading data into the graph. Includes a data browser.

    IBM System G - Also an RDF graph database supporting Java, Python and C++ with ACID capability. Includes a built-in visualizer.

    OrientDB - Has a console-based and Web tool graph editor supporting Scala, Python, Java or any REST language. Can work as an RDF triple store with a plug-in.

    Automated conversion of data into RDF or other graph formats is emerging with tools such as Trifacta or Any23. Trifacta effectively introduces graph over the top of Hadoop/MapReduce. Any23 extracts structured data from Web documents.

Graph Database in an Enterprise Architecture

With the data generated into petabytes and zetabytes over the coming years, enterprise architects need to brace towards establishing a graph database standard in their organizations. While modeling data, architects would need to understand the relationship between data and choose the best fit - SQL or NoSQL (Key Value Stores, Documents Database, Column Oriented Database or Graph Database). In traditional data management, we prepare logical and physical data models. The logical data model describes business requirements and the physical data model specifies how data is to be persisted in a database. In a graph database, the logical model is the physical model. Architects also need to understand the query tools, search tools, user representation tools for reporting, dashboards available for production support. Performance, reliability, flexibility, security and scalability will of course be part of the determining factors.