Software Abstractions Perspective: Big Data and Graph Databases
Why do we need Graph Databases?
Today applications and devices generate a flood of data. This high volume of data is typically incredibly dense and highly related; it does not fall neatly into pre-defined schemas. Facebook’s Open Graph and twitter’s interest graphare two obvious examples, but there are many other domains where this applies, such as Healthcare, E-commerce and sensor data.
This type of data of highly-connected entities is not easily modeled using traditional relational schemas; instead, using graph data structures makes it easy to represent connected data and to perform rapid analysis on these large datasets.
In a graph database, data is stored as nodes and relationships; both nodes and relationships have properties. Instead of capturing relationships between entities in a join table as in a Relational Database, a Graph Database captures the relationships themselves and their properties directly within the stored data.
To quote Derrick Harris from his GigaOm article:
Graph analysis is among the hottest techniques around for making sense of large datasets, primarily by determining how tightly different data points are related or how similar they are.
Although graphs have recently become more popular because of their applicability in modeling social networks, graph analysis can be widely applied to analyze any kind of relationships.
Modeling the data as graphs allows data scientists to discover localized patterns; i.e. how are specific items related to other items. Even with large datasets, most analytics queries end up acting locally within a graph. All graphs share common patterns – simple examples include the diamond, butterfly and star patterns, however these simple patterns can be composed into arbitrarily complex patterns.
Graph Database: Neo4J
One of the exciting entries in this area is Neo4J, a Java-based open source graph database from Swedish company Neo Technology. Neo4J stores graph data directly and offers large-scale horizontal scalability using replication; in addition it offers ACID transactions and indexes similar to a traditional database. Neo4J also has a REST API and its own graph query language called Cypher.
A good starting point to learn about Neo4J is Robert Scoble’s video interview with Neo Technology’s CEO, Emil Eifrem. To me, the most interesting section is where Emil nails the value of graph databases in general (at the 6:30 mark in the video):
Sophisticated intelligence and reasoning is all about how are things related to one another. … Whenever the value is in the connection between things, that’s when a graph database excels.
You can watch the video here: http://vimeo.com/56040747
Interestingly, he says that the next step for Neo4J is to work on transparent partitioning, to enable the database to automatically co-locate highly-connected nodes. There is lots of good information about graph databases on the Neo4J site, here: http://www.neo4j.org/learn/neo4j .