Creating and using a Neo4j Graph Model for specific use cases

Lately I’ve been busy talking at conferences to tell people about our way to create large Neo4j databases. Large means some tens of millions of nodes and hundreds of millions of relationships and billions of properties.

Although the technical description is already on the Xebia blog part 1 and part 2, I would like to give a more functional view on what we did and why we started doing it in the first place.

Our use case consisted of exploring our data to find interesting patterns. The data we want to explore is about financial transactions between people, so the Neo4j graph model is a good fit for us. Because we don’t know upfront what we are looking for we need to create a Neo4j database with some parts of the data and explore that. When there is nothing interesting to find we go enhance our data to contain new information and possibly new connections and create a new Neo4j database with the extra information.

This means it’s not about a one time load of the current data and keep that up to date by adding some more nodes and edges. It’s really about building a new database from the ground up everytime we think of some new way to look at the data.

First try without hadoop

Before we created our Hadoop based solution, we used the batchimport framework provided with Neo4j (the batch inserter API). This allows you to insert a large amount of nodes and edges without the transactional overhead (Neo4j is ACID compliant). The batch importer API is a very good fit for the medium sized graphs, or the one time imports of large datasets, but in our case recreating multiple databases a day, the running time was too long.

Scaling out

To speed the process we wanted to use our Hadoop cluster. If we could make the process of creating a Neo4j database work in a distributed way, we could make use of the total amount of cluster machines instead of the single machine batchimporter.

But how do you go about that? The batch import framework was built upon the idea of having a single place to store the data. Having a server running somewhere the cluster could connect to had multiple downsides:

  • How to handle downtime of the Neo4j server
  • You’re back to being transactional
  • You need to check if nodes are already existing

So the idea became to build the database really from the ground up. Would it be possible to build the underlying filestructure without having the need of Neo4j running somewhere? Would be cool right?
Read the full article.

Here is the accompanying video: