InfoQ: Graph Databases – Book Review and Interview
Graph Databases book authored by Ian Robinson, Jim Webber, and Emil Eifrém, covers the Graph based NoSQL database technology and different options available for storing “Connected Data” in real world applications. Authors discuss topics like data modeling with graph based domain models and predictive analysis using Graph Theory techniques.
InfoQ spoke with co-authors Ian Robinson and Jim Webber about the book, role of Graph Databases in the NoSQL database space, and what’s coming up in the Graph Databases.
InfoQ: Congratulations on the new book. What is the current status of the book?
Ian and Jim: Thanks Srini! The book was written over the course of the last year, with the free rough cut version made available about 8 months into the process. We’re now at the final editing phase with O’Reilly and the completed book will be out on June 10th. We’ll be giving away copies to all the attendees at the GraphConnect conferences this year, as well as at many of our Neo4j community events.
InfoQ: How does a Graph database compare with a relational database and with other NoSQL databases?
Ian and Jim: Alistair Jones coined the term “relational crossroads” to describe the differences. According to Alistair, we’re at a fork in the road. One path leads to the approach adopted by most NOSQL databases, whereby data is heavily denormalized, and we then rely on the application to join it together, typically at high latency, to gain insight. The other path leads to the approach adopted by graph databases, whereby we use the expressive power of the graph to build a flexible, connected model of the problem at hand, which we then query at low latency to gain insight.
The relational database sits in the middle. Much like a graph database, the relational database has a query-centric model. But this model isn’t as powerful as that of a graph database. In particular, it lacks the ability to create arbitrarily extensive, semantically rich and variably connected structures at runtime. To create any kind of extensive structure in a relational database, we have to plan our joins up front. To allow for variation, we end up creating lots of nullable columns. The result: sparse tables, fancy (expensive) joins, and object-relational impedance problems, even when used in otherwise straightforward applications.
InfoQ: What are the use cases where Graph database is a better fit?
Ian and Jim: Like relational databases, the majority of graph databases are general purpose OLTP databases, and can be applied in a wide range of solutions. That said, they really shine when the solution to a problem depends upon us first understanding how things are connected. This is more common than you might think. And in many cases, it’s not just a matter of knowing that things are connected; often, we need to know something about the different relationships in our domain – their names, qualities, weights, and so on.
In short, connectedness is key. Graphs are the best abstraction we have for modeling and querying connectedness; graph databases, in turn, give application developers and data specialists the power to apply this abstraction to their own particular problems. To that end, we’ve seen them used for social networking, recommendations engines, authorization and access control, routing and logistics, product catalogues, datacenter management, careers management, fraud detection, policing, geospatial, and even emigration applications. The key theme that binds all of these solutions together is that they depend on being able to model and query connected data. Simply having keys and values is not enough; nor is having the data sort-of connected through semantically impoverished joins. We need both connectivity and contextual richness to make these solutions work.
InfoQ: Can you discuss the design considerations developers need to keep in mind when using Graph databases?
Ian and Jim: The most important design decisions concern the data model and queries for any particular application. The key here, as we describe in the book, is to drive out your model from the questions you need to ask of your data. By undercovering the questions at the heart of an application goal or end user need, you identify both the things you’re interested in and the ways in which these things are connected. It’s a short step then to turn these questions into an expressive graph model, and thereafter into the queries you execute against that model.
The resulting model and associated queries are really just projections of the questions you want to ask of your data. With Neo4j’s Cypher query language the complementary nature of these projections becomes obvious: the paths you use to create the graph structure are the same paths you use to query it.
As a good first test of your design, you should be able to read what you have written. Most importantly, you should be able to read your questions in what you have written, without having to make any out-of-band assumptions or inferences. A structure such as ‘(Emil)-[:WROTE]->(Graph Databases)<-[:PUBLISHED]-(O’Reilly)’ reads well, and clearly helps us answer the questions, “Which books has Emil written?”, “Who published Graph Databases?” and “For which publishers has Emil written?”.
InfoQ: What are some architecture and design patterns supported by Graph databases?
InfoQ: Do Graph databases, by their nature, have any limitations in terms of scalability?
Ian and Jim: When we talk about scaling we’re talking about three different things: scaling for large datasets, scaling for read performance, and scaling for write performance.
Regarding scaling for large datasets, there are really no inherent limitations. Neo4j currently has an arbitrary upper limit on the size of the graph (on the order of 10^10, which is large enough to support most graphs we see in the real world, including a Neo4j deployment that has more than half of Facebook’s social graph in one Neo4j cluster), but this limit is going away later this year. This removes the majority of concerns people have with regard to scaling graphs for “big” data.
Scaling for reads similarly presents no problem. Today, in Neo4j, this is accomplished using master-slave replication. Read requests are load balanced across the cluster, where they execute against local data whose structure has been optimized for connected queries. Aside from transactional integrity, Neo4j has historically focused on read performance. To enable fast traversals we have a graph-native stack all the way down to disk. Some other graph stores offer a graph interface over a nonnative graph store, such as a column store. While this may help with other scaling characteristics, it tends to increase query latencies, because the joins are performed in library code, rather than at the level of the data model.
Scaling for writes can be accomplished by scaling vertically, but at some point, for very heavy write loads, it requires the ability to distribute the data across multiple machines. This is the real challenge. While distributing the data may help write performance, it can harm read performance. So far, nobody has implemented a graph database that optimizes and combines fast local traversals with slower (over the network) distributed traversals.
Scaling graph data by distributing it across multiple machines is much more difficult than scaling some of the simpler data models, but it is possible. Scaling a key-value, column-family or document store is relatively easy because in each case you’re dealing with discrete aggregates that can be placed in their entirety in one location or another. Scaling a graph is more difficult because by its very nature the data is connected. When distributing a graph, we want to avoid having relationships that span machines as much as possible; this is called the minimum point-cut problem. On top of the problem of balancing the graph so that there are as few machine-spanning relationships as possible, things are made even more difficult because the graph is always changing. What looks like a good distribution one moment may no longer be optimal a few seconds later. This is known to be an NP-hard problem in the general case.
While we can’t claim to have solved the NP-hard general graph partitioning problem, we’ve found some pretty interesting ways of side-stepping it. In fact we have an R&D team busily experimenting on those ideas right now, and at some point they’ll make it into future product versions. On top of that, we have some tricks up our sleeve to increase the write scaling limits with the current technology.
So to answer the question, we’d say that the connected nature of graph data imposes a theoretical challenge to scaling writes by distributing the graph. But there are practical solutions to the distributed graph problem, which we’re exploring.