Making Sense of Connected Data with Neo4j
By Jim Webber
In order to provide better availability, scale and simplicity, the NOSQL movement has pushed data storage towards simpler models with more sophisticated computation infrastructure compared to traditional RDBMS. In fact, aggregate-oriented NOSQL stores utilizing external map/reduce processing are commonplace today.
In contrast, graph databases like Neo4j actually provide a far richer data model than a traditional RDBMS and a search-centric rather than compute-intensive method for processing data. Instead of reifying links between data by processing that data in batches, Neo4j uses graphs as an expressive means of understanding, storing, and rapidly querying rich domain models.
Although the graph data model is the most expressive supported by the NOSQL stores, graphs can be difficult to understand amid the general clamor of the simpler-is-better NOSQL mantra. But what makes this doubly strange is that “simple” NOSQL databases have avowed their capabilities for graph processing too.
This strange duality where non-graphs stores can be used for (limited) graph applications was the subject many insightful mailing list posts from the graphistas in the Neo4j community. Community members have variously discussed the value of using non-graph stores for managing graph data, particularly since prominent online services have made popular graph processing (like Twitter’s FlockDB).
As it happens the use-case for those graphs tends to be relatively shallow – “friend” and “follow” kinds of relationships. In those situations, it might be a passable solution to hold information in your values, document properties, columns, or even rows in a relational database to indicate a shallow relation as we can see in Figure 1:
Figure 1 Using convention to imply links in flat data
At runtime, the application using the data store (remember: that’s code we have to write and maintain) will try to process the implied foreign keys between stored documents to create a logical graph representation effectively reifying the business information out of the flat data source as we see in Figure 2. This means our application code needs to understand how to create a graph representation from those unlinked documents by enforcing some convention (and doing so consistently at scale).
Figure 2 Reifying connected data out of flat data is hard work!
If the graphs are shallow, this approach might work. Twitter’s FlockDB is an existential proof of such. But as relationships between data become deeper, more intricate and valuable, this is an approach that runs out of steam.
Using the documents and foreign keys convention requires implicit graphs to be structured early on in the system lifecycle (at design time), which in turn means a particular topology from limited understanding of the data model of the system at that time is baked into the data store and into the application layer. This implies tight coupling between the code that reifies the graphs and the mechanism through which they’re flattened into and retrieved from the aggregate store. Any structural changes to the graph now require changes both to the stored data and the logic that reifies the data.
Neo4j takes a different approach: it stores graphs natively and so separates application and storage concerns. Simply put, where your application declares interrelated documents, that’s they way they’re stored, searched, and processed in Neo4j even if those relationships are very deep.
In our social example, the logical graph that we reified from the document store can be natively and efficiently persisted in Neo4j while remaining true to the domain that it supports as shown in Figure 3.
Figure 3 Storing connected data natively in Neo4j
It’s often deceptively easy in some use cases to project a graph from an aggregate store in the beginning. In fact it might seem that using an aggregate store – with its comfortable key/value semantics – is more productive.
For example, we might develop an e-commerce application with customers and items they have purchased. In an aggregate-oriented database we could store the identifiers of products our customers had bought inside the customer entity (or create separate purchase entities linked again by convention only). However making sense of that storage plan again requires application-level code that we have to build and maintain.
In Neo4j however we simply add relationships named PURCHASED between customer nodes and the product nodes they’d bought. Since Neo4j is schema-less, adding these relationships doesn’t require migrations, nor would it affect any existing code using the data. Figure 4 shows this contrast: the graph structure is explicit in the graph database, but implicit in a document store.
Figure 4 Comparing natively connected data with flat disconnected data
Even at this stage, the graph shows its flexibility. Imagine that a number of customers bought a product that had to be recalled. In the document case we’d run a query (typically using a map/reduce framework) that grabs the document for each customer and checks whether a customer has the identifier for the defective product in their purchase history. This is a big undertaking if each customer has to be checked (though thankfully because it’s an embarrassingly parallel operation we can throw hardware at the problem). We could also design a clever indexing scheme, provided we can tolerate the write latency and space costs that indexing implies.
With Neo4j, all we need to do is locate the product (by graph traversal or index lookup) and look for incoming PURCHASED relations to determine immediately which customers need to be informed about the product recall. Easy!
Of course system requirements are rarely static and it’s a simple step from a basic e-commerce application to evolve a social aspect to shopping so that customers can receive buying recommendations based on what their social group has purchased (and by extension people like them, people who live in their area, and people whose purchase history is similar) as we see in Figure 5.
Figure 5 Social recommendations and selling are simple with graphs
In the aggregate store, we now have to encode the notion of friends and even friends of friends into the store and into the business logic that reifies the graph. This is where things start to get tricky since now we have a deeper traversal from a customer to customers (friends) to customers (friends of friends) and then into purchases. What initially seemed simple is now starting to look dauntingly like a fully-fledged graph store, albeit one we have to build and maintain.
Conversely with Neo4j we simply use the FRIEND relationships between customers, and for recommendations we simply traverse the graph across all outgoing FRIEND relationships (limited to depth 1 for immediate friends, or depth 2 for friends-of-friends), and for outgoing PURCHASED relationships to see what they’ve bought. What’s important here is that it’s Neo4j that handles the hard work of traversing the graph, not the application code, and Neo4j is capable of traversing millions of relationships per second on commodity hardware.
But there’s much more value the e-commerce site can drive from this data. Not only can social recommendations be implemented by the activities of close friends, but the e-commerce site can also start to look for trends and base recommendations on them. This is precisely the kind of thing that supermarket loyalty schemes do with big iron and long-running SQL queries – yet Neo4j can achieve better results on commodity hardware (even down to the point of sale terminal itself).
For example, one set of customers that we might want to incentivize are those people who we think are young performers. These are customers that perhaps have told us something about their age, and we’ve noticed a particular buying pattern surrounding them – they buy DJ-quality headphones. Often those same customers buy DJ-quality decks too, but there’s a potentially profitable set of those customers that – shockingly – don’t yet own decks (much to the gratitude of their roommates and neighbors I suspect).
With an aggregate-oriented database, looking for this pattern by trawling through all customer documents and projecting a graph is laborious, even with map/reduce frameworks to automate the plumbing code and parallelize the work. But matching these patterns in a graph is quite straightforward and efficient – simply by specifying a prototype to match against and then by efficiently traversing the graph structure looking for matches as we see in Figure 6.
Figure 6 Sophisticated analytics patterns are made simple with graphs
This shows a wonderful emergent property of graphs – simply store all the data you like as nodes and relationships in Neo4j and later you’ll be able to extract useful business information that perhaps you can’t imagine today, without the performance penalties associated with joins on large datasets or latency associated with batch processing on external map/reduce frameworks.
In these kinds of situations, choosing a non-graph store for storing graphs is a gamble. You may find that you’ve designed your graph topology far too early in the system lifecycle and lose the ability to evolve the structure and perform business intelligence on your data. While in some edge cases non-functional concerns to drive you towards such a solution (e.g. taking advantage of the write throughput of Apache Cassandra or operational scale of Riak), in general you’re better off with a graph database.
That’s why Neo4j is cool – it keeps graph and application concerns separate, and allows you to defer data modeling decisions to more responsible points throughout the lifetime of your application. So if you’re looking to drive value form your graph data then try Neo4j.