Graph Databases: The New Way to Access Super Fast Social Data
by Emil Eifrem
Emil Eifrem is the founder of the Neo4j graph database project and CEO of Neo Technology, the world’s leading graph database. Emil is an internationally recognized thought leader in new database technology, having spoken at conferences in three continents.
Until the NOSQL wave hit a few years ago, the least fun part of a project was dealing with its database. Now there are new technologies to keep the adventuresome developer busy. The catch is, most of these post-relational databases, such as MongoDB, Cassandra, and Riak, are designed to handle simple data. However, the most interesting applications deal with a complex, connected world.
A new type of database significantly changes the standard direction taken by NOSQL. Graph databases, unlike their NOSQL and relational brethren, are designed for lightning-fast access to complex data found in social networks, recommendation engines and networked systems.
Pancake, for example, which is Mozilla’s next-generation browser project, uses a graph database to store browsing history in the cloud, since the web is just one big graph.
Graph theory dates back to 1735, when Leonard Euler solved the Seven Bridges of Königsberg problem by devising a topology consisting of nodes and relationships to answer the then-famous question, “Is it possible to trace a walk through the city that crosses every bridge just once?” Graph theory has since found many uses, but only recently has it been applied to storing and managing data.
It turns out that graphs are a very intuitive way to represent relationships between data.
Think back to your earliest whiteboard graphing session. Traditionally, the developer would hand this off to a DBA, and if she were lucky, would receive a database one month later and start coding. This is because the relational model is tabular, and it takes both time and expertise to represent non-tabular data in a tabular format.
Graph databases let you represent related data as it inherently is: as a set of objects connected by a set of relationships, each with its own set of descriptive properties. With a graph database, the developer can start coding immediately, because the data stored in the database directly parallels the whiteboard representation.
Development agility is handy, but it wouldn’t amount to anything without nose-bleeding speed. A recent benchmark took a “friends of friends” query (which finds all of the immediately adjacent nodes and progresses outward one level at a time) and compared performance between a relational database to a graph database. With a query depth of three, the graph database ran over 150 times faster. With a query depth of four, the graph database was over 1,000 times faster.
The reason for this vast difference in performance lies in how data and relationships are stored inside the database. Native graph databases use a technique called “index-free adjacency.” In simple terms, this means that each data element points directly to its inbound and outbound relationships, which in turn, point directly to related nodes, and so on. This technique allows million of related records to be traversed per second.
Relational databases, on the other hand, need to carry out a number of steps to determine whether and how things are connected, and then to retrieve related data records. Response times slow down as a relational database grows in volume, which causes problems as a business grows. However with a graph database, traversal speed remains constant, not depending on the total amount of data stored. This allows the database to naturally keep up with one’s business as it grows.